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, Abstract 

In the sparse linear regression setting, we consider testing the significance of the predictor 
variable that enters the current lasso model, in the sequence of models visited along the lasso 
solution path. We propose a simple test statistic based on lasso fitted values, called the covari- 
ance test statistic, and show that when the true model is linear, this statistic has an Exp(l) 
asymptotic distribution under the null hypothesis (the null being that all truly active variables 
E"H ' are contained in the current lasso model). Our proof of this result for the special case of the 

^0 , first predictor to enter the model (i.e., testing for a single significant predictor variable against 

n ■ the global null) requires only weak assumptions on the predictor matrix X. On the other hand, 

our proof for a general step in the lasso path places further technical assumptions on X and 
the generative model, but still allows for the important high-dimensional case p > n, and does 
not necessarily require that the current lasso model achieves perfect recovery of the truly active 
variables. 

Of course, for testing the significance of an additional variable between two nested linear 
models, one typically uses the chi-squared test, comparing the drop in residual sum of squares 
(RSS) to a xi distribution. But when this additional variable is not fixed, and has been chosen 
■ adaptively or greedily, this test is no longer appropriate: adaptivity makes the drop in RSS 

stochastically much larger than %\ under the null hypothesis. Our analysis explicitly accounts 
for adaptivity, as it must, since the lasso builds an adaptive sequence of linear models as the 
tuning parameter A decreases. In this analysis, shrinkage plays a key role: though additional 
variables are chosen adaptively, the coefficients of lasso active variables are shrunken due to 
. the l\ penalty. Therefore the test statistic (which is based on lasso fitted values) is in a sense 

balanced by these two opposing properties — adaptivity and shrinkage — and its null distribution 
is tractable and asymptotically Exp(l). 
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c5 ■ 1 Introduction 



We consider the usual linear regression setup, for an outcome vector y £ K™ and matrix of predictor 
variables X e R" xp : 

y = X/3* + e, e~N(Q,a 2 I), (1) 

where (3* € W are unknown coefficients to be estimated. [If an intercept term is desired, then we 
can still assume a model of the form ([T]) after centering y and the columns of X; see Section |2~21 for 
more details.] We focus on the lasso estimator (Tibshirani 1996, Chen et al. 1998), defined as 

/3 = argmin \\\y - X(3\\ 2 2 + A^Ui, (2) 

PGR" 1 

where A > is a tuning parameter, controlling the level of sparsity in /3. Here we assume that the 
columns of X are in general position in order to ensure uniqueness of the lasso solution [this is quite 
a weak condition, to be discussed again shortly; see also Tibshirani (2012)]. 
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There has been a considerable amount of recent work dedicated to the lasso problem, both in 
terms of computation and theory. A comprehensive summary of the literature in either category 
would be too long for our purposes here, so we instead give a short summary: for computational 
work, some relevant contributions are Friedman et al. (2007), Beck & Teboulle (2009), Friedman 
et al. (2010), Becker, Bobin & Candes (2011), Boyd et al. (2011), Becker, Candes & Grant (2011); 
and for theoretical work see, e.g., Greenshtein & Ritov (2004), Fuchs (2005), Donoho (2006), Candes 
& Tao (2006), Zhao & Yu (2006), Wainwright (2009), Candes & Plan (2009). Generally speaking, 
theory for the lasso is focused on bounding the estimation error \\Xf3 — or — or 

ensuring exact recovery of the underlying model, supp(/3) = supp(/3*) [with supp(-) denoting the 
support function]; favorable results in both respects can be shown under the right assumptions on 
the generative model ((T|) and the predictor matrix X. Strong theoretical backing, as well as fast 
algorithms, have made the lasso a highly popular tool. 

Yet, there are still major gaps in our understanding of the lasso as an estimation procedure. 
In many real applications of the lasso, a practitioner will undoubtedly seek some sort of inferential 
guarantees for his or her computed lasso model — but, generically, the usual constructs like p- values, 
confidence intervals, etc., do not exist for lasso estimates. There is a small but growing literature 
dedicated to inference for the lasso, and important progress has certainly been made, mostly through 
the use of methods based on resampling or data splitting; we review this work in Section 12.51 The 
current paper focuses on a significance test for lasso models that does not employ resampling or data 
splitting, but instead uses the full data set as given, and proposes a test statistic that has a simple 
and exact asymptotic null distribution. 

Section [2] defines the problem that we are trying to solve, and gives the details of our proposal — 
the covariance test statistic. Section [3] considers an orthogonal predictor matrix X, in which case the 
statistic greatly simplifies. Here we derive its Exp(l) asymptotic distribution using relatively simple 
arguments from extreme value theory. Section U treats a general (nonorthogonal) X , and under 
some regularity conditions, derives an Exp(l) limiting distribution for the covariance test statistic, 
but through a different method of proof that relies on discrete-time Gaussian processes. Section [5] 
empirically verifies convergence of the null distribution to Exp(l) over a variety of problem setups. 
Up until this point we have assumed that the error variance a 1 is known; in Section [6] we discuss 
the case of unknown a 1 . Section [7] gives some real data examples. Section |8] covers extensions to 
the clastic net, generalized linear models, and the Cox model for survival data. We conclude with a 
discussion in Section [§1 

2 Significance testing in linear modeling 

Classic theory for significance testing in linear regression operates on two fixed nested models. For 
example, if M and M U {j} are fixed subsets of {1, . . .p}, then to test the significance of the jth 
predictor in the model (with variables in) M U {j}, one naturally uses the chi-squared test, which 
computes the drop in residual sum of squares (RSS) from regression on M U {j} and M, 

Rj = (RSSm - RSS MuW )/a 2 , (3) 

and compares this to a xi distribution. (Here a 2 is assumed to be known; when a 1 is unknown, we 
use the sample variance in its place, which results in the F-test, equivalent to the t-test, for testing 
the significance of variable j.) 

Often, however, one would like to run the same test for M and M U {j} that are not fixed, but 
the outputs of an adaptive or greedy procedure. Unfortunately, adaptivity invalidates the use of a 
Xi null distribution for the statistic ([3]). As a simple example, consider forward stepwise regression: 
starting with an empty model M = 0, we enter predictors one at a time, at each step choosing the 
predictor j that gives the largest drop in residual sum of squares. In other words, forward stepwise 
regression chooses j at each step in order to maximize Rj in ((3]), over all j ^ M. Since Rj follows 
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a xi distribution under the null hypothesis for each fixed j, the maximum possible Rj will clearly 
be stochastically larger than \\ under the null. Therefore, using a chi-squared test to evaluate the 
significance of a predictor entered by forward stepwise regression would be far too liberal (having 
type I error much larger than the nominal level). Figure 1(a) demonstrates this point by displaying 
the quantiles of R\ in forward stepwise regression (the chi-squared statistic for the first predictor to 
enter) versus those of a xi variate, in the fully null case (when /3* = 0). A test at the 5% level, for 
example, using the xi cutoff of 3.84, would have an actual type I error of about 39%. 




Figure 1: A simple example with n = 100 observations and p — 10 orthogonal predictors. All true regression 
coefficients are zero, f3* = 0. On the left is a quantile-quantile plot, constructed over 1000 simulations, of the 
standard chi-squared statistic Ri in measuring the drop in residual sum of squares for the first predictor 
to enter in forward stepwise regression, versus the xi distribution. The dashed vertical line marks the 95% 
quantile of the \i distribution. The right panel shows a quantile-quantile plot of the covariance test statistic 
T\ in ([5| for the first predictor to enter in the lasso path, versus its asymptotic null distribution Exp(l). The 
covariance test explicitly accounts for the adaptive nature of lasso modeling, whereas the usual chi-squared 
test is not appropriate for adaptively selected models, e.g., those produced by forward stepwise regression. 



The failure of standard testing methodology when applied to forward stepwise regression is not 
an anomaly — in general, there seems to be no direct way to carry out the significance tests designed 
for fixed linear models in an adaptive setting^ Our aim is hence to provide a (new) significance test 
for the predictor variables chosen adaptively by the lasso, which we describe next. 



2.1 The covariance test statistic 

The test statistic that we propose here is constructed from the lasso solution path, i.e., the solution 
/3(A) in @ a function of the tuning parameter A G [0, oo). The lasso path can be computed by the 
well-known LARS algorithm of Efron et al. (2004) [see also Osborne et al. (2000a), Osborne et al. 

1 It is important to mention that a simple application of sample splitting can yield proper p- values for an adaptive 
procedure like forward stepwise: e.g., run forward stepwise regression on one half of the observations to construct a 
sequence of models, and use the other half to evaluate significance via the usual chi-squared test. Some of the related 
work mentioned in Section l2.5l docs essentially this, but with more sophisticated splitting schemes. Our proposal uses 
the entire data set as given, and we do not consider sample splitting or resampling techniques. Aside from adding a 
layer of complexity, the use of sample splitting can result in a loss of power in significance testing. 
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(20006)], which traces out the solution as A decreases from oo to 0. Note that when rank(A) < p, 
there are possibly many lasso solutions at each A and therefore possibly many solution paths; we 
assume that the columns of A are in general position^, implying that there is a unique lasso solution 
at each A > and hence a unique path. The assumption that A has columns in general position is 
a very weak one [much weaker, e.g., than assuming that rank(A) = p]. For example, if the entries of 
X are drawn from a continuous probability distribution on R np , then the columns of X are almost 
surely in general position, and this is true regardless of the sizes of n and p. See Tibshirani (2012). 
Before defining our statistic, we briefly review some properties of the lasso path. 

• The path /3(A) is a continuous and piecewise linear function of A, with knots (changes in slope) 
at values Ai > A2 > . . . > A r > (these knots depend on y,X). 

• At A = 00, the solution $(00) has no active variables (i.e., all variables have zero coefficients); 
for decreasing A, each knot Afe marks the entry or removal of some variable from the current 
active set (i.e., its coefficient becomes nonzero or zero, respectively). Therefore the active set, 
and also the signs of active coefficients, remain constant in between knots. 

• At any point A in the path, the corresponding active set A = supp(/3(A)) of the lasso solution 
indexes a linearly independent set of predictor variables, i.e., rank(A J 4) = \A\, where we use 
Xa to denote the columns of X in A. 

• For a general A, the number of knots in the lasso path is bounded by 3 P (but in practice this 
bound is usually very loose). This bound comes from the following realization: if at some knot 
Afe, the active set is A = supp(/3(Afe)) and the signs of active coefficients are sa = sign(/3 J 4(Afc)), 
then the active set and signs cannot again be A and s at some other knot X( ^ Afe. This in 
particular means that once a variable enters the active set, it cannot immediately leave the 
active set at the next step. 

• For a matrix A satisfying the positive cone condition (a restrictive condition that covers, e.g., 
orthogonal matrices), there are no variables removed from the active set as A decreases, and 
therefore the number of knots is min{n,p}. 

We can now precisely define the problem that we are trying to solve: at a given step in the lasso 
path (i.e., at a given knot), we consider testing the significance of the variable that enters the active 
set. To this end, we propose a test statistic defined at the fcth step of the path. 

First we define some needed quantities. Let A be the active set just before Afe, and suppose that 
predictor j enters at Afe. Denote by j3(\k+i) the solution at the next knot in the path Afe+i, using 
predictors A U {j}. Finally, let /3a(A/c+i) be the solution of the lasso problem using only the active 
predictors Xa, at A = Xk+i- To be perfectly explicit, 

/34A fe+1 ) = argmin -\\y - X A p A f 2 + A fe+ i||/3 A ||i. (4) 

We propose the covariance test statistic defined by 

Tfe = ((y, Xp(X k+1 )) - (y, X A P A (\k + l)))/<T 2 . (5) 

Intuitively, the covariance statistic in ([5]) is a function of the difference between A/3 and XaPa, the 
fitted values given by incorporating the jih predictor into the current active set, and leaving it out, 

2 Points Xi, . . . X p G R™ are said to be in general position provided that no fc-dimensional affinc subspace L C R™, 
fe < min{n,p}, contains more than fc + 1 elements of {±X\, . . . ± X p }, excluding antipodal pairs. Equivalently: the 
affinc span of any fe + 1 points s\Xi 1 , . . . s^ +1 Xi fc+1 , for any signs si, . . . s^ +1 6 {—1, 1}, does not contain any element 
of the set {iXi : i ^ i\, . . . ife +1 }. 
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respectively. These fitted values are parametrized by A, and so one may ask: at which value of A 
should this difference be evaluated? Well, note first that (3 A {X k ) = $ A (X k ), i.e., the solution of the 
reduced problem at Afc is simply that of the full problem, restricted to the active set A (as verified 
by the KKT conditions). Clearly then, this means that we cannot evaluate the difference at A = A/,., 
as the jth variable has a zero coefficient upon entry at X k , and hence 

Xp(X k ) = X A p A (X k ) = X A p A {X k ). 

Indeed, the natural choice for the tuning parameter in ([5]) is A = Afc+i: this allows the jth coefficient 
to have its fullest effect on the fit X(3 before the entry of the next variable at X k+ i (or possibly, the 
deletion of a variable from A at Afc+i). 

Secondly, one may also ask about the particular choice of function of Xf3(X k +i) — X A (3 A (X k +\). 
The covariancc statistic in uses an inner product of this difference with y, which can be roughly 
thought of as an (uncentered) covariance, hence explaining its nameU At a high level, the larger the 
covariance of y with X(3 compared to that with X A f3 A , the more important the role of variable j in 
the proposed model A U {j}. There certainly may be other functions that would seem appropriate 
here, but the covariance form in ([5]) has a distinctive advantage: this statistic admits a simple and 
exact asymptotic null distribution. In Sections [3] and HI we show that under the null hypothesis that 
the current lasso model contains all truly active variables, A D supp(/3*), 

T k A Exp(l), 

i.e., T k is asymptotically distributed as a standard exponential random variable, given reasonable 
assumptions on X and the magnitudes of the nonzero true coefficients. [In some cases, e.g., when we 
have a strict inclusion A D supp(/3*), the use of an Exp(l) null distribution is actually conservative, 
because the limiting distribution of T k is stochastically smaller than Exp(l).] In the above limit, we 
are considering both n,p — ¥ oo; in SectionUwe allow for the possibility p > n, the high-dimensional 
case. 



See Figure 1(b) for a quantile-quantile plot of T\ versus an Exp(l) variate for the same fully null 



example (fi* = 0) used in Figure 1(a) this shows that the weak convergence to Exp(l) can be quite 
fast, as the quantiles are decently matched even for p = 10. Before proving this limiting distribution 
in Sections [3] (for an orthogonal X) and[2] (for a general X), we give an example of its application to 
real data, and discuss issues related to practical usage. We also derive useful alternative expressions 
for the statistic, discuss the connection to degrees of freedom, and review related work. 



2.2 Prostate cancer data example and practical issues 

We consider a training set of 67 observations and 8 predictors, the goal being to predict log of the 
PSA level of men who had surgery for prostate cancer. For more details see Hastie et al. (2008) and 
the references therein. Table [1] shows the results of forward stepwise regression and the lasso. Both 
methods entered the same predictors in the same order. The forward stepwise p-values are smaller 
than the lasso p-values, and would enter four predictors at level 0.05. The latter would enter only 
one or maybe two predictors. However we know that the forward stepwise p-values are inaccurate, 
as they are based on a null distribution that does not account for the adaptive choice of predictors. 
We now make several remarks. 

Remark 1. The above example implicitly assumed that one might stop entering variables into the 
model when the computed p-value rose above some threshold. More generally, our proposed test 

3 From its definition in (0, we get T k = (y - fi, X(3(\ k+1 )) - (y - fi, X A j3 A (\ k+1 )) + (fi, Xf3(\ k+1 ) - X A fJ A (X k+1 )) 
by expanding y = y — fj, + /j,, with fi = X fi* denoting the true mean. The first two terms are now really empirical 
covariances, and the last term is typically small. In fact, when X is orthogonal, it is not hard to see that this last 
term is exactly zero under the null hypothesis. 
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Table 1: Forward stepwise and lasso applied to the prostate cancer data example. The error variance is 
estimated by a 2 , the MSE of the full model. Forward stepwise regression p-values are based on comparing 
the drop in residual sum of squares (divided by a 2 ) to an F(l,n — p) distribution (using \i instead produced 
slight smaller p-values). The lasso p-values use a simple modification of the covariance test ((5]) for unknown 
variance, given in Section [§[ All p-values are rounded to 3 decimal places. 



Step 


Predictor 
entered 


Forward 
stepwise 


Lasso 


1 


lcavol 


0.000 


0.000 


2 


lweight 


0.000 


0.052 


3 


svi 


0.041 


0.174 


4 


lbph 


0.045 


0.929 


5 


pgg45 


0.226 


0.353 


6 


age 


0.191 


0.650 


7 


lep 


0.065 


0.051 


8 


gleason 


0.883 


0.978 



statistic and associated p-values could be used as the basis for multiple testing and false discovery 
rate control methods for this problem; we leave this to future work. 

Remark 2. In the example, the lasso entered a predictor into the active set at each step. For a 
general X, however, a given predictor variable may enter the active set more than once along the 
lasso path, since it may leave the active set at some point. In this case we treat each entry as a 
separate problem. Therefore, our test is specific to a step in the path, and not to a predictor variable 
at large. 

Remark 3. For the prostate cancer data set, it is important to include an intercept in the model. 
To accomodate this, we ran the lasso on centered y and column-centered X (which is equivalent to 
including an unpcnalized intercept term in the lasso criterion), and then applied the covariance test 
(with the centered data). In general, centering y and the columns of X allows us to account for the 
effect of an intercept term, and still use a model of the form (JT|). From a theoretical perspective, 
this centering step creates a weak dependence between the components of the error vector e £ R™. 
If originally we assumed i.i.d. errors, ej ~ N(0,a 2 ), then after centering y and the columns of X, 
our new errors are of the form ej = ej — e, where e = 2j=i e j/ n - K is easy see that these new errors 
are correlated: 

Cov(ei,ej) = —a 2 /n for i =/= j. 

One might imagine that such correlation would cause problems for our theory in Sections [3] and [4j 
which assumes i.i.d. normal errors in the model (JlJ. However, a careful look at the arguments in 
these sections reveals that the only dependence on y is through X T y, the inner products of y with 
the columns of X. Furthermore, 

Cov(X7e, Xje) = a 2 Xj (i - -ll^Xj = a 2 XfX,j for all i,j, 

which is the same as it would have been without centering (here 11 T is the matrix of all Is, and 
we used that the columns of X are centered). Therefore, our arguments in Sections [3] and U apply 
equally well to centered data, and centering has no effect on the asymptotic distribution of T^. 

Remark 4- By design, the covariance test is applied in a sequential manner, estimating p-values for 
each predictor variable as it enters the model along the lasso path. A more difficult problem is to 
test the significance of any of the active predictors in a model fit by the lasso, at some arbitrary 
value of the tuning parameter A. We discuss this problem briefly in Section |U 
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2.3 Alternate expressions for the covariance statistic 

Here we derive two alternate forms for the covariance statistic in (0. The first lends some insight 
into the role of shrinkage, and the second is helpful for the convergence results that we establish in 
Sections [3] and SJ We rely on some basic properties of lasso solutions; see, e.g., Tibshirani & Taylor 
(2012), Tibshirani (2012). To remind the reader, we are assuming that X has columns in general 
position. 

For any fixed A, if the lasso solution has active set A = supp(/3(A)) and signs sa — sign(/3>i(A)), 
then it can be written explicitly (over active variables) as 

$ A {\) = {XlXA^Xlv - \{xlx A )- 1 SA . 

In the above expression, the first term (X^X a) -1 X^y simply gives the regression coefficients of y 
on the active variables Xa, and the second term — X{X^X A )~ l s A can be thought of as a shrinkage 
term, shrinking the values of these coefficients towards zero. Further, the lasso fitted value at A is 

X$(X) = P A y - X(X T A ) + s A , (6) 

where Pa = Xa(X a Xa)~ 1 X a denotes the projection onto the column space of Xa, and (X A ) + = 
Xa^X^Xa)' 1 is the (Moore-Penrose) pseudoinverse of X\. 

Using the representation ([5]) for the fitted values, we can derive our first alternate expression for 
the covariance statistic in ([3]). If A and sa are the active set and signs just before the knot Afc, and 
j is the variable added to the active set at Afc , with sign s upon entry, then by (JB]) , 

X(3(X k+1 ) = P A u{j}V- x k+i{X Au{j} ) + )s AU { 3 }, 

where s A u{j} = sig n (/3Au{j} (Afc+i))- We can equivalcntly write s A u{j} — ( s A,s), the concatenation 
of sa and the sign s of the jth coefficient when it entered (as no sign changes could have occurred 
inside of the interval [Afc, Afc+i], by definition of the knots). Let us assume for the moment that the 
solution of reduced lasso problem at X k+ i has all variables active and sa = sign(/3A(Afc + i)) — 
remember, this holds for the reduced problem at Afc, and we will return to this assumption shortly. 
Then, again by (JB]), 

X A pA{Xk+i) = PaV - X k+ i(Xl)+SA, 
and plugging the above two expressions into (JS|), 

T k = y T (P A u {l} - Pa)vIo 2 - X k+1 ■ y T ((xT u{j} )+s Au{n - (X T A ) + s A )/o 2 . (7) 

Note that the first term above is y T (PAu{j} -PA)y/o 2 = (\\y - Pav\\1 - \\y ~ PauMVWD I v 2 , which 
is exactly the chi-squarcd statistic for testing the significance of variable j, as in ((3]). Hence if A,j 
were fixed, then without the second term, T k would have a xl distribution under the null. But of 
course A, j are not fixed, and so much like we saw previously with forward stepwise regression, the 
first term in ([7]) will be generically larger than xl , because j is chosen adaptively based on its inner 
product with the current lasso residual vector. Interestingly, the second term in ([7]) adjusts for this 
adaptivity: with this term, which is composed of the shrinkage factors in the solutions of the two 
relevant lasso problems (on X and Xa), we prove in the coming sections that T k has an asymptotic 
Exp(l) null distribution. Therefore, the presence of the second term restores the (asymptotic) mean 
of Xfc to 1, which is what it would have been if A,j were fixed and the second term were missing. 
In short, adaptivity and shrinkage balance each other out. 

This insight aside, the form ([7]) of the covariance statistic leads to a second representation that 
will be useful for the theoretical work in Sections [3] and HI We call this the knot form of the covariance 
statistic, described in the next lemma. 
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Lemma 1. Let A be the active set just before the kth step in the lasso path, i.e., A = supp(/3(Afc)), 
with Afe being the kth knot. Also let sa denote the signs of the active coefficients, sa = sign(/3 J 4(Afe)), 
j be the predictor that enters the active set at A& , and s be its sign upon entry. Then, assuming that 

sa = sign(/3 A (A fc+1 )), (8) 

or in other words, all coefficients are active in the reduced lasso problem @ at Xk+i and have signs 
Sa, we have 

T k = C(A,s A ,j,s) ■ \ k (\k - X k+ i)/a 2 , (9) 

where 

C(A,s A ,j,s) = ||(Xj u{j} )+ SAum - {Xl) + s A \\l 
and SAu{j} * s the concatenation of sa and s. 

The proof starts with expression ([7]), and arrives at © through simple algebraic manipulations. 
We defer it until ApDcndix lA.il 

When does the condition (0) hold? This was a key assumption behind both of the forms ([7]) and 
© for the statistic. We first note that the solution Pa of the reduced lasso problem has signs sa at 
Afc, so it will have the same signs 5,4 at \k+i provided that no variables are deleted from the active 
set in the solution path Pa{^) for A € [Afc+i, A&]. Therefore, assumption © holds: 

1. When X satisfies the positive cone condition (which includes X orthogonal), because no vari- 
ables ever leave the active set in this case. In fact, for X orthogonal, it is straightforward to 
check that C(A, saJ, s) = 1, so T fe = A fe (A fe - A fc +i)/er 2 . 

2. When k = 1 (we are testing the first variable to enter), as a variable cannot leave the active 
set right after it has entered. If k = 1 and X has unit norm columns, ||Afj||2 = 1 for i = 1, . . .p, 
then we again have C{A, SA,j, s) = 1 (note that A = 0), so T\ = Ai(A x — A 2 )/er 2 . 

3. When sa = sign((X a) + y) , i.e., sa contains the signs of the least squares coefficients on Xa, 
because the same active set and signs cannot appear at two different knots in the lasso path 
(applied here to the reduced lasso problem on Xa). 

The first and second scenarios are considered in Sections 131 and B~T1 respectively. The third scenario 
is actually somewhat general and occurs, e.g., when sa — sign((XA) + y) = sign(/3^), both the lasso 
and least squares on Xa recover the signs of the true coefficients. Section [4721 studies the general X 
and k > 1 case, wherein this third scenario is important. 



2.4 Connection to degrees of freedom 

There is a interesting connection between the covariancc statistic in ([5]) and the degrees of freedom 
of a fitting procedure. In the regression setting ([T]), for an estimate y [which we think of as a fitting 
procedure y = y(y)}, its degrees of freedom is typically defined (Efron 1986) as 

1 ™ 

df ^) = ^E Cov (^)- ( 10 ) 

i—l 

In words, df (y) sums the covariances of each observation yi with its fitted value yi. Hence the more 
adaptive a fitting procedure, the higher this covariance, and the greater its degrees of freedom. The 
covariance test evaluates the significance of adding the jth predictor via a something loosely like a 
sample version of degrees of freedom, across two models: that fit on A U {j}, and that on A. This 
was more or less the inspiration for the current work. 

Using the definition (|10[) . one can reason [and confirm by simulation, just as in Figure l(a)| that 



with k predictors entered into the model, forward stepwise regression had used substantially more 
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than k degrees of freedom. But something quite remarkable happens when we consider the lasso: 
for a model containing k nonzero coefficients, the degrees of freedom of the lasso fit is equal to k 
(either exactly or in expectation, depending on the assumptions) [Efron et al. (2004), Zou et al. 
(2007), Tibshirani & Taylor (2012)]. Why does this happen? Roughly speaking, it is the same 
adaptivity versus shrinkage phenomenon at play. [Recall our discussion in the last section following 
the expression (|7|) for the covariance statistic] The lasso adaptively chooses the active predictors, 
which costs extra degrees of freedom; but it also shrinks the nonzero coefficients (relative to the 
usual least squares estimates), which decreases the degrees of freedom just the right amount, so that 
the total is simply k. 

2.5 Related work 

There is quite a lot of recent work related to the proposal of this paper. Wasserman & Roeder (2009) 
propose a procedure for variable selection and p- value estimation in high-dimensional linear models 
based on sample splitting, and this idea was extended by Meinshausen et al. (2009). Meinshausen 
& Buhlmann (2010) propose a generic method using resampling called "stability selection", which 
controls the expected number of false positive variable selections. Minnier et al. (2011) use pertur- 
bation resampling-based procedures to approximate the distribution of a general class of penalized 
parameter estimates. One big difference with the work here: we propose a statistic that utilizes the 
data as given and does not employ any resampling or sample splitting. 

Zhang & Zhang (2011) derive confidence intervals for contrasts of high-dimensional regression 
coefficients, by replacing the usual score vector with the residual from a relaxed projection (i.e., the 
residual from sparse linear regression). Buhlmann (2012) constructs p- values for coefficients in high- 
dimensional regression models, starting with ridge estimation and then employing a bias correction 
term that uses the lasso. Unlike these works, our proposal is based on a test statistic with an exact 
asymptotic null distribution. Javanmard & Montanari (2013) also give a simple statistic of lasso 
coefficients with an exact asymptotic distribution (in fact, their statistic is asymptotically normal), 
but they do so for the special case that the predictor matrix X has i.i.d. Gaussian rows. 

3 An orthogonal predictor matrix X 

We examine the special case of an orthogonal predictor matrix X, i.e., one that satisfies X T X = I. 
Even though the results here can be seen as special cases of those for a general X in Section |U the 
arguments in the current orthogonal X case rely on relatively straightforward extreme value theory 
and are hence much simpler than their general X counterparts (which analyze the knots in the lasso 
path via Gaussian process theory). Furthermore, the Exp(l) limiting distribution for the covariance 
statistic translates in the orthogonal case to a few interesting and previously unknown (as far as we 
can tell) results on the order statistics of independent standard xi variatcs. For these reasons, we 
discuss the orthogonal X case in detail. 

As noted in the discussion following Lemma [T] (see the first point), for an orthogonal X, we know 
that the covariance statistic for testing the entry of the variable at step k in the lasso path is 

Tk = Afe(Afe - Afc+i)/er 2 . 

Again using orthogonality, we rewrite \\y — Xj3\\ \ = || X T y — + C for a constant C (not depending 
on 0) in the criterion in ©, and then we can see that the lasso solution at any given value of A has 
the closed-form: 

4-(A) = S x (Xjy), j = l,...p, 
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where Xi,... X p are columns of X, and S\ : R — > E, is the soft-thresholding function, 

{x — A if x > A 
if - A < x < A 
x + A if x < A. 

Letting Uj = -^/Vj J = lj • • -Pi the knots in the lasso path are simply the values of A at which the 
coefficients become nonzero (i.e., cease to be thresholded), 

Xx = \U {1) \, A 2 = |E/ (2) |,... X v = \U (j>) \, 

where \U(x)\ > \U(2)\ > ■ ■ ■ > \U( P )\ are the order statistics of |Z7i|, . . . \U P \ (somewhat of an abuse of 
notation). Therefore, 

T k = \U {k) \(\U ik) \-\U {k+1) \)/a 2 . 

Next, we study the special case k = 1, the test for the first predictor to enter the active set along 
the lasso path. We then examine the case k > 1, the test at a general step in the lasso path. 

3.1 The first step, k = 1 

Consider the covariance test statistic for the first predictor to enter the active set, i.e., for k = 1, 

T 1 = |C/ (1) |(|C/ (1) |-|C/ (2) |)/<7 2 . 

We are interested in the distribution of T\ under the null hypothesis; since we are testing the first 
predictor to enter, this is 

H Q : i/~JV(0,<t 2 J). 

Under the null, Ux, ■ ■ ■ U p are i.i.d., Uj ~ iV(0, cr 2 ), and so |[/i|/er, . . . \U v \/a follow a \i distribution 
(absolute value of a standard Gaussian). That T\ has an asymptotic Exp(l) null distribution is now 
given by the next result. 

Lemma 2. Let Vx > Y% > . . . > V p be the order statistics of an independent sample of \i variates 
(i.e., they are the sorted absolute values of an independent sample of standard Gaussian variates). 
Then 

Vx(Vx - V 2 ) 4 Exp(l) as p -> oo. 
Proof. The xi distribution has CDF 

F(x) = (2$(x) - l)l(x > 0) 
where $ is the standard normal CDF. We first compute 

F» m - F {t)) = j(i-m) = _ t 

t^L (F'(t)) 2 ti™ 4>{t) 

the last equality using Mills' ratio. Then Theorem 2.2.1 in de Haan & Ferreira (2006) implies that, 
for constants a p = i< 1_1 (l — 1/p) and b p = pF'(a p ), the random variables W\ = b p {V\ — a p ) and 
W2 = b p (V2 — a p ) converge jointly in distribution, 

(Wi,W2)4 (-logEx,-log(Ex+E 2 )), 
where Ex,E% are independent standard exponentials. Now note that 

Vx(Vx - V 2 ) = (a p + Wx/b p )(Wx - W 2 )/b p = ^(Wx - W 2 ) + w ^ ~ w ^ m 

Up Up 
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We claim that a p /b p —> 1; this would give the desired result, as it would imply that first term above 
converges in distribution to log (£2 + E\) — \og(E\), which is standard exponential, and the second 
term converges to zero, as b p — > 00. Writing a p , b p more explicitly, we see that 1 — l/p = 2$(a p ) — 1, 
i.e., 1 — $(a p ) = l/(2p), and b p = 2pcj)(a p ). Using Mills' inequalities, 



^ 1 -<l-*(ap)<^, 



l/al 



and multiplying by 2p, 



1 „ b 

< 1 < 



a p l + l/a p a p 

Since a p — > 00, this means that b p /a p — > 1, completing the proof. □ 

We were unable to find this remarkably simple result elsewhere in the literature. An easy gener- 
alization is as follows. 

Lemma 3. If Vi > V 2 > . . . > V p are the order statistics of an independent sample of xi variates, 
then for any fixed k > 1 , 

(Vx(Vi - V 2 ), V 2 (V 2 - Vi), . . . V k (V k - Vk+i)) A (Exp(l),Exp(l/2), . . .Exp(l/fc)) as p -> 00, 
where the limiting distribution (on the right-hand side above) has independent components. 

To be perfectly clear, here and throughout we use Exp(a) to denote the exponential distribution 
with scale parameter a (not rate parameter a), so that if Z ~ Exp (a), then E[Z] = a. We leave 
the proof of Lemma [3] to Appendix IA.21 since it follows from arguments very similar to those given 
for Lemma [2j Practically, Lemma [3] tells us that under the global null hypothesis y ~ A^(0,cr 2 ), 
comparing the covariance statistic T k at the fcth step of the lasso path to an Exp(l) distribution 
is increasingly conservative [at the first step, T\ is asymptotically Exp(l), at the second step, T 2 
is asymptotically Exp(l/2), at the third step, T3 is asymptotically Exp(l/3), and so forth]. This 
progressive conservatism is favorable, if we place importance on parsimony in the fitted model: we 
are less and less likely to incur a false rejection of the null hypothesis as the size of the model grows. 
Moreover, we know that the test statistics T\, T 2 , ■ ■ ■ at successive steps are independent, and hence 
so are the corresponding p-valucs; from the point of view of multiple testing corrections, this is 
nearly an ideal scenario. 

Of real interest is the distribution of T k , k > 1 not under global null, but rather under the weaker 
null hypothesis that all variables left out of the current model are truly inactive variables (i.e., they 
have zero coefficients in the true model). We study this in next section. 



3.2 A general step, k > 1 

We suppose that exactly kg components of the true coefficient vector /3* are nonzero, and consider 
testing the entry of the predictor at step k = ko + 1. Let A* = supp(/3*) denote the true active set 
(so ko = \A*\), and let B denote the event that all truly active variables are added at steps 1, . . . ko, 

S={min|C/ J |>niax|C/ J |}. (11) 

We show that under the null hypothesis (i.e., conditional on B), the test statistic 71- 0+ i is asymp- 
totically Exp(l), and further, the test statistic T ko+ d at a future step k = ko + d is asymptotically 
Exp(l/d). 

The basic idea behind our argument is as follows: if we assume that the nonzero components 
of p* are large enough in magnitude, then it is not hard to show (relying on orthogonality, here) 
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that the truly active predictors are added to the model along the first ko steps of the lasso path, 
with probability tending to one. The test statistic at the (ko + l)st step and beyond would therefore 
depend on the order statistics of \Ui\ for truly inactive variables i, subject to the constraint that 
the largest of these values is smaller than the smallest \Uj\ for truly active variables j. But with 
our strong signal assumption, i.e., that the nonzero entries of (3* are large in absolute value, this 
constraint has essentially no effect, and we are back to studying the order statistics from a xi 
distribution, as in the last section. This is made precise below. 

Theorem 1. Assume that X € R" xp is orthogonal, and y £ E" is drawn from the normal regression 
model ([1]), where the true coefficient vector (3* has kg nonzero components. Let A* = supp(/3*) be the 
true active set, and assume that the smallest nonzero true coefficient is large compared to ay/2 \ogp, 

min |/3* | — ay2 \ogp — > oo as p — > oo. 

j<£A* J 

Let B denote the event in (|lll) . namely, that the first kg variables entering the model along the lasso 
path are those in A* . Then P(_B) — > 1 as p oo, and for each fixed d > 0, we have 

(Tk +i,T ko+2 ,...T ko+d ) ->■ (Exp(l),Exp(l/2),...Exp(l/d)) as p -> oo. 
The same convergence in distribution holds conditionally on B . 
Proof. We first study ¥(B). Let 9 p = min igj 4« |/3*|, and choose c p such that 

c. p — cry2 logp — > oo and 8 p — c p — > oo. 
Note that Uj ~ N(/3*, a 2 ), independently for j = 1, . . .p. For j e A* , 

P(|^| < Cp) = ®(^—^ - ^( -Cp ~^ ) < $ (^^) "> °' 

so 

P( mm \U 3 \ > Cp) = [] n\U 3 \ > Cp) ^ 1. 

At the same time, 

P(max|[/,| < c p ) = ($(c p /a) - ^-Cp/a))^" -> 1. 

Therefore P(_B) — » 1. This in fact means that ¥(E\B) — P(_E) — ^ for any sequence of events E 1 , so 
only the weak convergence of (T/- _|_i, . . . Tk +d) remains to be proved. For this, we let m = p — ko, 
and Vi > V2 > ■ ■ ■ > V m denote the order statistics of the sample \Uj\, j A* of independent xi 
variates. Then, on the event B, we have 

T ko+i = Vi(Vi - V i+1 ) for * = 1, ... d. 

As F(B) —> 1, we have in general 

T ko+l = V^V, - V i+1 ) + o P (l) for * = 1, ... d. 

Hence we are essentially back in the setting of the last section, and the desired convergence result 
follows from the same arguments as those for Lemma [3J □ 
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4 A general predictor matrix X 



In this section, we consider a general predictor matrix X, with columns in general position. Recall 
that our proposed covariance test statistic is closely intertwined with the knots Ai > . . . > A r 
in the lasso path, as it was defined in terms of difference between fitted values at successive knots. 
Moreover, Lemma Q] showed that (provided there are no sign changes in the reduced lasso problem 
over [Afc+i,Afc]) this test statistic can be expressed even more explicitly in terms of the values of 
these knots. As was the case in the last section, this knot form is quite important for our analysis 
here. Therefore, it is helpful to recall (Efron et al. 2004, Tibshirani 2012) the precise formulae for 
the knots in the lasso path. If A denotes the active set and sa denotes the signs of active coefficients 
at a knot X^, 

A = supp(/3(A)), s A = sign(/3 A (A fc )), 
then the next knot Xk+i is given by 

A fe+1 =max{A^i,A^ c }, (12) 

where A J fe °!" and A^™ are the values of A at which, if we were to decrease the tuning parameter from 
Afe and continue along the current (linear) trajectory for the lasso coefficients, a variable would join 
and leave the active set A, respectively. These values ar^3 

xj oi„ Xj{I-P A )y I Xj(I-P A )v \ 

Ai , i = max rr , — =— ■ 1 < — =— < Afc > , (13) 

k+1 HA.,se{-i,i} s-Xj{Xl)+s A \s-Xj(Xl)+ SA J V ' 

where Pa is the projection onto the column space of Xa, Pa = Xa{X^Xa)~ 1 X]^, and (Xj) + is 
the pseudoinverse (Aj) + = {X\X a)^ 1 X\; and 

Ueave _ mfly Ua^v], f [(X A ) + y} 3 \ . . 

A fc+1 - max — - • 1 < — - <X k >. (14) 
jeA [{XXX a) L s A [j I [{XXX a) L s A [j J 

As we did in Section [3] with the orthogonal X case, we begin by studying the asymptotic dis- 
tribution of the covariance statistic in the special case k = 1 (i.e., the first model along the path), 
wherein the expressions for the next knot (fT2")l . (fT3"|) . (fT4")l greatly simplify. Following this, we study 
the more difficult case k > 1. For the sake of readability we defer the proofs and most technical 
details until the appendix. 

4.1 The first step, k — 1 

We assume here that X has unit norm columns: ||Xj||2 = 1, for i = 1, . . . p; we do this mostly for 
simplicity of presentation, and the generalization to a matrix X whose columns are not unit normed 
is given in the next section (though the exponential limit is now a conservative upper bound). As 
per our discussion following Lemma Q] (see the second point), we know that the first predictor to 
enter the active set along the lasso path cannot leave at the next step, so the constant sign condition 
((HJ) holds, and by Lemma [T] the covariance statistic for testing the entry of the first variable can be 
written as 

T 1 = A 1 (Ai-A 2 )/a 2 

(the leading factor C being equal to one since we assumed that X has unit norm columns). Now let 
Uj = Xjy, j = 1, . . .p, and R = X T X. With Ao = oo, we have ^4 = 0, and trivially, no variables 
can leave the active set. The first knot is hence given by (fl"3)) . which can be expressed as 

Ai = max sUj. (15) 

i=l,...p,ae{-l,l} 



4 In expressing the joining and leaving times in the forms H31 and Q , we are implicitly assuming that A fe+1 < A fc , 
with strict inequality. Since X has columns in general position, this is true for (Lebesgue) almost every y, or in other 
words, with probability one taken over the normally distributed errors in ifTJ. 
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Letting ji, si be the first variable to enter and its sign (i.e., they achieve the maximum in the above 
expression), and recalling that j± cannot leave the active set immediately after it has entered, the 
second knot is again given by (|13l) . written as 

\ _ maY * (; J - / '.- ,- r .-: ! f slI J ~ sR J,n U Ji . , rj 

A 2 — max — • 1 < — < s\Uj 1 

The general position assumption on X implies that \Rjj 1 \ < 1, and so 1 — ss\Rj l j 1 > 0, all j ^ j±, 
s G {—1, 1}. It is easy to show then that the indicator inside the maximum above can be dropped, 
and hence 

sUj — sRj ,-, Uj, , . 

A 2 = max — 3 - t^—^. 16 

j¥ii,«6{-l,l} 1-SSlRjji 

Our goal now is to calculate the asymptotic distribution of T\ = Ai(Ai — A 2 )/c 2 , with Ai and A 2 as 
above, under the null hypothesis; to be clear, since we are testing the significance of the first variable 
to enter along the lasso path, the null hypothesis is 

H : y~N{0,<j 2 I). (17) 

The strategy that we use here for the general X case — which differs from our extreme value theory 
approach for the orthogonal X case — is to treat the quantities inside the maxima in expressions 
(|15p . p6p for Ai, A 2 as discrete-time Gaussian processes. First, we consider the zero mean Gaussian 
process 

g(j, s) = sUj for j = 1, . . .p, s G {-1, 1}. (18) 

We can easily compute the covariance function of this process: 

^[g(j,s)g(j',s')] = s.J !>>,,■ a 2 . 

where the expectation is taken over the null distribution in (1171) . From (|15[) . we know that the first 
knot is simply 

Ai = max g(j,s), 
In addition to (fT8]). we consider the process 

h^(j,s) = ga V 5l V (jl ' Sl) fOT S e ^ '• 1 '• (19) 

1 — ssiHjj 1 

An important property: for fixed Ji,si, the entire process h^ 1,81 '(J,s) is independent of g{j\,s\). 
This can be seen by verifying that 

E[g{j 1 ,s 1 )h^Hj,s)] =0, 
and noting that g(ji, s\) and h^ 1 - Sl ^(j, s), all j ^ ji, s € { — 1, 1}, are jointly normal. Now define 

M(ji,si) = max (j, s), (20) 

mi , s 

and from the above we know that for fixed j\, s\, M{j\, s\) is independent of g(ji, Si). If ji, si are 
instead treated as random variables that maximize g(j, s) (the argument maximizers being almost 
surely unique), then from (|16l) we see that the second knot is A 2 = M(ji,si). Therefore, to study 
the distribution of T\ = Ai(Ai — A 2 )/<r 2 , we are interested in the random variable 

9{ji,si)(g{ji,s 1 ) - M(j 1 ,s 1 ))/a 2 1 

on the event 

[g(.h,si) > g(j,s) for all j,s|. 

It turns out that this event, which concerns the argument maximizers of g, can be rewritten as an 
event concerning only the relative values of g and M [see Taylor et al. (2005) for the analogous result 
for continuous-time processes]. 
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Lemma 4. With g,M as defined in ([15]). ([TO)) , l[2"0]). we /iai>e 

This is an important realization because the dual representation {<?(ji, Si) > M(j-y, si)} is more 
tractable, once we partition the space over the possible argument minimizers j\, si, and use the fact 
that M(Ji, si) is independent of g(ji, s\) for fixed ji, Si. In this vein, we express the distribution of 
T\ = Ai(Ai — \2)lo 2 in terms of the sum 

P(Tj > i) = S p (syi.*i)(ff(?i.*i) -M(j 1)Sl ))/ CT 2 > t, > M(ji,«i)). 

The terms in the above sum can be simplified: dropping for notational convenience the dependence 
on ji, si, we have 

g(g - M)/a 2 > t, g > M g/a > u(t,M/o), 



where u{a,b) = (b + \fb 2 + 4o)/2, which follows by simply solving for g in the quadratic equation 
g (g -M)/a 2 = t. Therefore 



P(T X > t) = ^ V[g(ji,Bx)/<T > u{t,M{j x , Sl )/a) 

POO 

= S / m /o-))^f(ii,«)(*»). ( 21 ) 

where $ is the standard normal survival function (i.e., $ = 1 — $, for $ the standard normal CDF), 
F M (j ljSl ) is the distribution of M(ji, s\), and we have used the fact that g(ji, Si) and M(J±, Si) are 
independent for fixed ji, si, and also M(ji, s\) > on the event si) > M(jx,si)}. (The latter 

follows as Lemma S] shows this event to be equivalent to ji, s± being the argument maximizers of g, 
which means that M(ji,si) = A2 > 0.) Continuing from (f2~Tj) , we can write the difference between 
P(Xi > t) and the standard exponential tail, P(Exp(l) > t) = e - *, as 



|P(Ti >*)- e~*\ 
where we used the fact that 



^ f°° f<S>(u(t,m/a)) A - , 



(22) 



pOO 

V / #(m/a)F M(jl . Sl) (dm) = V P( 5 (ji,si) > M(ji,«i)) = 1. 

„•. „. Jo ■ „_ 



We now examine the term inside the braces in (|22[) , the difference between a ratio of normal survival 
functions and e - *; our next lemma shows that this term vanishes as to — > 00. 

Lemma 5. For any t > 0, 

m)) _ t 



$(m) 



— > e as to — > 00. 



Hence, loosely speaking, if each M(ji, Si) — > 00 fast enough as p — > 00, then the right-hand side 
converges to zero, and T\ converges weakly to Exp(l). This is made precise below. 



Lemma 6. Consider M(j\, Si) defined in (|19[) . (|20p over ji = 1, . . .p and si G { — 1, 1}- If for any 

fixed mo > 

5^ P(M(ji,«i) < m ) -> as p -> oo, (23) 
i/ien i/ie right-hand side in (|22|) converges to zero as p — > 00, and so P(Tj > t) — > e _< /or a// £ > 0. 
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The assumption in (f2"5)) is written in terms of random variables whose distributions are induced 
by the steps along the lasso path; to make our assumptions more transparent, we show that (|23[) is 
implied by a conditional variance bound involving the predictor matrix X alone, and arrive at the 
main result of this section. 

Theorem 2. Assume that X G R nxp has unit norm columns in general position, and let R = X T X . 
Assume also that there is some 8 > such that for each j — 1, . . .p, there exists a subset of indices 
SC{l,...p}\{j} with 

1 - Ri,s\{i}(Rs\{i},s\{i}y 1 Rs\{i},i > S 2 for alli€S, (24) 
and the size of S growing faster than logp, 

151 > d„, where — ^ y oo as p — s- oo. (25) 

logp 

The under the null distribution in (|17[) [i.e., y is drawn from the regression model ([l} with f3* = 0], 
we have P(Ti > t) — > e as p — > oo for all t > 0. 

Remark. Conditions (f2"4"|) and (|2"5)l are sufficient to ensure (f2U|) . or in other words, that each M(ji, si) 
grows as in P(M(ji,si) < m ) = o(l/p), for any fixed mo- While it is true that E[M(ji,si)] will 
typically grow as p grows, some assumption is required so that M(ji,s\) concentrates around its 
mean faster than standard Gaussian concentration results (such as the Borcll-TIS inequality) imply. 

Generally speaking, the assumptions (|24|) and (|25|) are not very strong. Stated differently, (p4|) is 
a lower bound on the variance of Ui = Xfy, conditional on Ue = Xjy for all I e 3 \ {i}. Hence for 
any j, we require the existence of a subset S not containing j such that the variables Ui, i € S are 
not too correlated, in the sense that the conditional variance of any one on all the others is bounded 
below. This subset S has to be larger in size than logp, as made clear in (|23|) . Note that, in fact, it 
suffices to find a total of two disjoint subsets Si, S2 with the properties (|24|) and (|25"|) . because then 
for any j, either one or the other will not contain j. 

An example of a matrix X that does not satisfy and ([23)1 is one with fixed rank as p grows. 
(This, of course, would also not satisfy the general position assumption.) In this case, we would not 
be able to find a subset of the variables Ui = Xfy, i = 1, . . .p that is both linearly independent 
and has size larger than r = rank(JT), which violates the conditions. We note that in general, since 
15*1 < rank(X) < n, and IS'l/logp — > 00, conditions (|24|) and (|25|) require that n/\ogp — > 00. 

4.2 A general step, k > 1 

In this section, we no longer assume that X has unit norm columns (in any case, this provides no 
simplification in deriving the null distribution of the test statistic at a general step in the lasso path). 
Our arguments here have more or less the same form as they did in the last section, but overall the 
calculations are more complicated. 

Fix an integer ko > 0, subset Ao C {1, . . .p] containing the true active set Aq D A* = supp(/3*), 
and sign vector sa € { — 1, l}'" 4 "'. Consider the event 



B = ^Thc solution at step ko in the lasso path has active set A ~ Aq, 

signs sa = sign((X a ) + y) = sa (i , and the next two knots are given by 

Xk0+1 - WJfet-M} s-X?{XZ)+B A ' Xk " +2 Afc » +2 J- (26) 
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We assume that P(B) —> 1 as p — >• oo. In words, this is assuming that with probability approaching 
one: the lasso estimate at step fco in the path has support Aq and signs sa ', the least squares estimate 
on Aq has the same signs as this lasso estimate; the knots at steps fco + 1 and fco + 2 correspond 
to joining events; and in particular, the maximization defining the joining event at step fco + 1 can 
be taken to be unrestricted, i.e., without the indicators constraining the individual arguments to be 
< Xk ■ Our goal is to characterize the asymptotic distribution of the covariance statistic Tk at the 
step k = fco + 1, under the null hypothesis (i.e., conditional on the event B). We will comment on 
the stringency of the assumption that ¥(B) — > 1 following our main result in Theorem [3J 

First note that on B, we have sa = sign((X a) + y) , and as discussed in the third point following 
Lemma [TJ this implies that the solution of the reduced problem Q on Xa cannot incur any sign 
changes over the interval [Afe, Afe+i]. Hence we can apply Lemma [T] to write the covariance statistic 
on B as 

T k = C(A,s A ,jk,s k ) ■ A fc (A fc - X k+1 )/a 2 , 

where C(A, sa, jk, Sfc) = \\( X Au{j k }) + SA ^{jk} — (^a) +s a|||> A and sa are the active set and signs 
at step k — 1, and jk is the variable added to the active set at step fc, with sign Sk- Now, analogous 
to our definition in the last section, we define the discrete-time Gaussian process 



9 (A ' SA) {3^)= X ^l ( Jrt V for j (£A,s€ {-1,1}. (27) 



For any fixed A, sa, the above process has mean zero provided that ADA*. Additionally, for any 
such fixed A, sa, we can compute its covariance function 



E 



9^(j,s) 9 ^Hf,s')] = l A } X 'f T{x T]+ y (28) 

J [s-Xj{XX) + SA\[s' -Xp{X x A ) + SA\ 

Note that on the event B, the fcth knot in the lasso path is 

A fc = max g {A ' SA) (j,s). 

3$A,s£{-lA} 

For fixed jk, s k , we also consider the process 

g (Au {jk }, SAUW )^ s) = Xj(^I-P Au ^ } )y for j$AU {j k }, s e {-1, 1} (29) 

s ~ x : K X Au{j k }) +s Au{ lk } 

(above, SAu{j fc } i s t ne concatenation of sa and Sk) and achieved its maximum value, subject to being 
less than the maximum of g( A ' SA \ 

M^ SA \j k ,s k )= max g (Au{ jk }, SAulJk} )^ s) . 

j$Au{j k } se{-i,i} 

l\g(Au{j k UAu W )^ s \ < max g {A ' SA \j,s)\. (30) 
[ j<tA,se{-iA} J 

If jk, Sk indeed maximize g( A,SA \ i.e., they correspond to the variable added to the active set at A^ 
and its sign (note that these are almost surely unique), then on B, we have Afc+i = M(j k , s k )- To 
study the distribution of Tk on B, we are therefore interested in the random variable 

C(A, s A ,]k, s k ) ■ 9^ SA) (jk,s k ) ( 9 {A > SA) (jk,s k ) - (j k ,s k )) jo\ 

on the event 

E( Jk ,s k ) = { 9 {A ' SA \jk,s k ) > 9 iA ' SA) (j,s) for all j,s}. (31) 
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Equivalently, we may write 

F({T k >t}DB) = w({C(A,s A ,j k ,s k )-g( A > s ^(j k ,s k ) ■ 

jk,s k 

(s (M,) (j^ fc )-M ( ^»( ft , Sl ))/ ff 2 >t}n%, s ,; 

Since P(B) — > 1, we have in general 
F(T k >t)= p({c(A ,s Ao ,j k , Sk )-g^ SA o)( Jk:Sk ) . 

jk,Sk 

(g iAo ' SAa \jk, Sk ) - M^' s -o)^ Sk )^ > t } n E( Jk ,s k )\ + (1), (32) 



where we have replaced all instances of A and s A on the right-hand side above with the fixed subset 
Aq and sign vector sa - This is a helpful simplification, because in what follows we may now take 
A = A and sa = sa as fixed, and consider the distribution of the random processes g\ A o,*A ) anc | 
M( A °' SA o\ With A = Aq and sa = sa fixed, we drop the notational dependence on them and write 
these processes as g and M. We also write the scaling factor C(Aq, sa , jk> s k) as C(j k ,s k ). 

The setup in (|52")l looks very much like the one in the last section [and to draw an even sharper 
parallel, the scaling factor C(j k , s k ) is actually equal to one over the variance of g(j k , s k ), meaning 
that yj C(j k , s k ) ■ g{j k ,s k ) is standard normal for fixed j k ,s k , a fact that we will use later in the 
proof of Lemma [8]. However, a major complication is that g(j k ,s k ) and M(j k ,s k ) are no longer 
independent for fixed j k , s k . Next, we derive a dual representation for the event (|3Tj) (analogous to 
Lemma E] in the last section), introducing a triplet of random variables M + , M~ , M° — it turns out 
that g is independent of this triplet, for fixed j k , s k . 

Lemma 7. Let g be as defined in (|27[) (with A, sa fixed at Aq, sa )- Let 'Sjj' denote the covariance 
function of g [short form for the expression in (pll/Fl Define 



S+V,s) = Uf,sr.^<l\, M+ (j , S ) = max ^ ~ f^'/^Mj, f) (33) 

L hjj > (0',s')eS+(3,s) 1 - hj t j>/2jjj 

S-( j ,s) = {(f,sr.^>l}, M- {j ,s) = ( min /^«'> ~ ^M§Ml S l , (34) 
S°(j,s) = Uj',s'):^L = l}, M°(j,s)= max jOV)"^ W.*)- (35) 

T/ien i/ie event E(j k , s k ) in (|31[) . i/iai j^, maximize g, can be written as an intersection of events 
involving M+ , M~ , M° : 

[g{jk, Sk) > g(j, s) for all j, s j = {g(j k , s k ) > M + (j k , s fe )| n {g(j k , s k ) < M~(j k , s fc ) j 

n{o>M°(j fc , Sfc )}. (36) 

As a result of Lemma [7J continuing from (|32[) , we can decompose the tail probability of T k as 

P(T fc > i) = F (c(jk,s k )-g(jk,s k )(g(jk,Sk) ~ M(j k ,s k ))/a 2 > t, 

jk,S k 

g{jk,s k ) > M + (] k ,s k ), g(j k ,s k ) < M-(j k ,s k ), > M°(j fe ,s fe )) +o(l). (37) 



5 To be perfectly clear, here S ■/ actually depends on s, s' , but our notation suppresses this dependence for brevity. 
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A key point here is that, for hxcd j k , s k , the triplet M + (jk, Sk), M (Jk, s k ), M°(jk, Sk) is independent 
°f g(jk, Sk), which is true because 



E 



9(jk, s k ) (g(j, s) - (£ jk j/Y, jk j k )g(j k ,s k )J 



= 0, 



and g(jk, Sk), along with g(j, s) — (£>j k ,j /^j k ,j k )g(jk> s k), for all j, s, form a jointly Gaussian collection 
of random variables. If we were to now replace M by M + in the first line of (f3"T|). and define a 
modified statistic Tk via its tail probability, 

P(T fe > t) = v(c(jk,s k ) ■ g(j k ,Sk){g(jk,s k ) - M + (j k ,s k ))/a 2 > t, 

3k, Sk 

g(jk,s k )>M+(jk,s k ), g(j k ,s k ) <M-(j k ,s k ), > M°(j k , s k )) , (38) 

then arguments similar to those in the second half of Section 14.11 give a (conservative) exponential 
limit for ¥(f k > t). 

Lemma 8. Consider g as defined in (|27|) (with A, sa fixed at Aq, sa ), o,nd M + , M~, M° as defined 
in (|33p . (|34j) . (|35p . Assume that for any fixed mo, 

J2 V>(M+(j k ,s k ) < mo/y/C(jk,s k )) ^0 as p y oo , (39) 

jk,s k 

Then the modified statistic Tk in (|38[) satisfies limp^oo P(Tfc > t) < e _ ' ; for all t > 0. 

Of course, deriving the limiting distribution of Tk was not the goal, and it remains to relate 
P(Tfe > t) to P(Tfc > t). A fortuitous calculation shows that the two seemingly different quantities 
M + and M — the former of which is defined as the maximum of particular functionals of g, and the 
latter concerned with the joining event at step k+1 — admit a very simple relationship: M + (jk, s k ) < 
M(jk, Sk) for the maximizing j k , Sfe. We use this to bound the tail of Tk- 

Lemma 9. Consider g, M as defined in (|27j). (|29p . ()30j) (with A, sa fixed at Aq, sa ), and consider 
AI + as defined in (|34|) . Then for any fixed jk, s k , on the event E(jk,s k ) in pip , we have 

M+(jk,s k ) < M(j k ,s k ). 

Hence if we assume as in Lemma\$\ the condition Q39p. then limp-^oo ¥(T k > t) < e~ l for all t > 0. 

Though Lemma IHl establishes a (conservative) exponential limit for the covariance statistic Tk, it 
does so by enforcing assumption (|39p , which is phrased in terms of the tail distribution of a random 
process defined at the fcth step in the lasso path. We translate this into an explicit condition on 
the covariance structure in (|28|) . to make the stated assumptions for exponential convergence more 
concrete. 

Theorem 3. Assume that X £ R nxp has columns in general position, and y £ R™ is drawn from 
the normal regression model ([T]). Assume that for a fixed integer fco > subset Aq C {1, . . .p} with 
Ao ^ A* — supp(/3*), and sign vector sa„ £ {— 1, l}'' 4 "', the event B in (|26p satisfies P(B) — > 1 as 
p — > oo. Assume that there exists a constant < r\ < 1 such that 

WiX^+XjWi < l-n forallj^A Q . (40) 

Define the matrix R by 

Rij = Xj{I - P Ao )Xj , for i, j<£A . 
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Assume that the diagonal elements in R are all of the same order, i.e., Ru/Rjj < C for all i,j and 
some constant C > 0. Finally assume that, for each fixed j ^ Aq, there is a set S C {1, . . .p} \ (Aq U 
{j}) such that for all i £ S, 

[Rii - i? l ,5\{i}(^s\{i},s\{i}) _1 - R s\{i},i]/^ii > $ 2 , (41) 
\R ij \/R jj <r,/(2-r ] ), (42) 
||(X AoUO} )+X i || 1 <l, (43) 

where 5 > is a constant (not depending on j), and the size of S grows faster than logp, 

151 > d p , where — > oo as p — > oo. (44) 

logp 

Then at step k = ko + 1, we have linip-yoo P(Tfc > t) < e _t /or aZZ t > 0. TTie same result holds for 
the tail of conditional on B. 

Remark 1. If X has unit norm columns, then by taking fc = (and accordingly, Aq — 0, s^ = 0) 
in Theorem [3J we essentially recover the result of Theorem [5] To sec this, note that with ko = 
(and Aq, sa = 0), we have P(£>) = 1 for all finite p (recall the arguments given at the beginning of 
Section l4~Tj) . Also, condition (|40|) trivially holds with r\ = 1 because Aq = 0. Next, the matrix R 
defined in the theorem reduces to R = X T X , again because Aq = 0; note that R has all diagonal 
elements equal to one, because X has unit norm columns. Hence (j4Tj) is the same as condition (|24|) 
in Theorem [21 Finally, conditions (|4"2"j) and (|4"3"|) both reduce to \Rij\ < 1, which always holds as X 
has columns in general position. Therefore, when fco = 0, Theorem [3] imposes the same conditions 
as Theorem [2 and gives essentially the same result — we say "essentially" here is because the former 
gives a conservative exponential limit for T\, while the latter gives an exact exponential limit. 

Remark 2. If X is orthogonal, then for any Aq, conditions (|40|) and (|41 |) -(|44 |) arc trivially satisfied 
[for the latter set of conditions, we can take, e.g., S = {1, . . .p} \ (A U {j})}. With an additional 
condition on the strength of the true nonzero coefficients, we can assure that F(B) — > 1 as p — > oo 
with Aq = A*, sa„ = sign(/3^ o ), and fco = |^-o|> and hence prove a conservative exponential limit for 
Tfc; note that this is precisely what is done in Theorem Q] (except that in this case, the exponential 
limit is proven to be exact). 

Remark 3. Defining Ui = Xf (I — Pa )v for i ^ Aq, the condition (|4*Tj) is a lower bound on ratio of 
the conditional variance of Ui on Ui , £ ^ S, to the unconditional variance of Ui. Loosely speaking, 
conditions (|4T|) . (j42"]). and (|43j) can all be interpreted as requiring, for any j ^ Aq, the existence of 
a subset S not containing j (and disjoint from Aq) such that the variables Ui, i € S are not too 
correlated. This subset has to be large in size compared to logp, by (|44|). Similar to the discussion 
in the remark following Theorem [51 it suffices to find two disjoint subsets in total, Si, S2, satisfying 
(|4"Tj) - (|4"4|) . because for any j, at least one of Si, S2 will not contain j. Also, as argued in this same 
remark, the conditions (|4ip - (|44"l) imply that n/logp — > 00. 

Remark 4- Some readers will likely recognize condition (|40|) as that of mutual incoherence or strong 
irrepresentability, commonly used in the lasso literature on exact support recovery [see, e.g., Wain- 
wright (2009), Zhao & Yu (2006)]. This condition, in addition to a lower bound on the magnitudes 
of the true coefficients, is sufficient for the lasso solution to recover the true active set A* with 
probability tending to one, at a carefully chosen value of A. It is important to point out that we do 
not place any requirements on the magnitudes of the true nonzero coefficients; instead, wc assume 
directly that the lasso converges (with probability approaching one) to some fixed model defined by 
Aq,sa at the (fco)th step in the path. Here Aq is large enough that it contains the true support, 
Aq D A* , and the signs sa are arbitrary — they may or may not match the signs of the true coeffi- 
cients over Aq. In a setting in which the nonzero coefficients in j3* are well-separated from zero, a 
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condition quite similar to the irrepresentable condition can be used to show that the lasso converges 
to the model with support Aq = A* and signs sa — sign.(/3^ o ), a ^ s t°P = |^o| of the path. Our 
result extends beyond this case, and allows for situations in which the lasso model converges to a 
possibly larger set of "screened" variables Aq , and fixed signs sa ■ 

Remark 5. In fact, one can modify the above arguments to account for the case that Aq does not 
contain the entire set A* of truly nonzero coefficients, but rather, only the "strong" coefficients. 
While "strong" is rather vague, a more precise way of stating this is to assume that f3* has nonzero 
coefficients both large and small in magnitude, and with Ag corresponding to the set of large coef- 
ficients, we assume that the (left-out) small coefficients must be small enough that the mean of the 
process g in (j2"7| (with A = Aq and sa = sa ) grows much slower than M + . The details, though 
not the main ideas, of the arguments would change, and the result would still be a conservative 
exponential limit for the covariance statistic Tfc at step k = ko + 1. We will pursue this extension in 
future work. 

5 Simulation of the null distribution 

We investigate the null distribution of the covariance statistic through simulations, starting with an 
orthogonal predictor matrix X , and then considering more general forms of X . 

5.1 Orthogonal predictor matrix 

Similar to our example from the start of Section^ we generated n = 100 observations with p = 10 
orthogonal predictors. The true coefficient vector /3* contained 3 nonzero components equal to 6, 
and the rest zero. The error variance was a 2 = 1, so that the truly active predictors had strong 
effects and always entered the model first, with both forward stepwise and the lasso. Figure [2] shows 
the results for testing the 4th (truly inactive) predictor to enter, averaged over 500 simulations; 
the left panel shows the chi-squared test (drop in RSS) applied at the 4th step in forward stepwise 
regression, and the right panel shows the covariance test applied at the 4th step of the lasso path. 
We see that the Exp(l) distribution provides a good finite-sample approximation for the distribution 
of the covariance statistic, while \\ is a poor approximation for the drop in RSS. 

Figure [3] shows the results for testing the 5th, 6th, and 7th predictors to enter the lasso model. 
An Exp(l)-based test will now be conservative: at a nominal 5% level, the actual type I errors are 
about 1%, 0.2%, and 0.0%, respectively. The solid line has slope 1, and the broken lines have slopes 
1/2, 1/3, 1/4, as predicted by Theorem [TJ 

5.2 General predictor matrix 

In Table [U we simulated null data (i.e., j3* =0), and examined the distribution of the covariance 
test statistic T\ for the first predictor to enter. We varied the numbers of predictors p, correlation 
parameter p, and structure of the predictor correlation matrix. In the first two correlation setups, 
the correlation between each pair of predictors was p, in the data and population, respectively. 
In the AR(1) setup, the correlation between predictors j and j' is p*i~i L Finally, in the block 
diagonal setup, the correlation matrix has two equal sized blocks, with population correlation p in 
each block. We computed the mean, variance, and tail probability of the covariance statistic T\ over 
500 simulated data sets for each setup. We see that the Exp(l) distribution is a reasonably good 
approximation throughout. 

In Table [31 the setup was the same as in Tabled except that we set the first k coefficients of the 
true coefficient vector equal to 4, and the rest zero, for k = 1,2,3. The dimensions were also fixed 
at n = 100 and p = 50. We computed the mean, variance, and tail probability of the covariance 
statistic Tfe + i for entering the next (truly inactive) (k + l)st predictor, discarding those simulations 



21 



1 — 

02468 10 012345 

Chi-squared on 1 df Exp(1) 

(a) Forward stepwise (b) Lasso 

Figure 2: An example with n = 100 and p = 10 orthogonal predictors, and the true coefficient vector having 
3 nonzero, large components. Shown are quantile-quantile plots for the drop in RSS test applied to forward 
stepwise regression at the J^th step and the covariance test for the lasso path at the J^th step. 



in which a truly inactive predictor was selected in the first k steps. (This occurred 1.7%, 4.0%, and 
7.0% of the time, respectively.) Again we see that the Exp(l) approximation is reasonably accurate 
throughout. 

In Figure 01 we estimate the power curves for significance testing via the drop in RSS test for 
forward stepwise regression, and the covariance test for the lasso. In the former we use simulation- 
derived cutpoints, and in the latter we use the theoretically-based Exp(l) cutpoints, to control the 
type I error at the 5% level. We find that the tests have similar power, though the cutpoints for 
forward stepwise would not be typically available in practice. For more details see the figure caption. 



6 The case of unknown a 2 

Up until now we have assumed that the error variance a 2 is known; in practice it will typically be 
unknown. In this case, provided that n > p, we can easily estimate it and proceed by analogy to 
standard linear model theory. In particular, we can estimate a 2 by the mean squared residual error 
(j 2 = \\y — X (3 LS \\2/ (n — p), with /3 LS being the regression coefficients from y on X (i.e., the full 
model). Plugging this estimate into the covariance statistic in ([5]) yields a new statistic Fk, that has 
an asymptotic F-distribution under the null: 

P (y,XP(\ k+1 ))-(y,X A p A (\k + i)) d „ .... 
a 

This follows because Fk = Tk / (a 2 / a 2 ) , the numerator T k being asymptotically Exp(l) = xi/2) the 
denominator a 2 /a 2 being asymptotically Xn-p/ ( n ~ P)> an( i we claim that the two are independent. 
Why? Note that the lasso solution path is unchanged if we replace y by Pxy, so the lasso fitted 
values in Tk are functions of Pxy', meanwhile, a 2 is a function of (I — Px)y. The quantities Pxy 
and (I — Px)y are uncorrelated and hence independent (recalling normality of y), so Tk and a 2 are 
functions of independent quantities, and therefore independent. 
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5th predictor 6th predictor 7th predictor 




01234 01234 01234 

Exp(1) Exp(1) Exp(1) 

Figure 3: The same setup as in Figure® but here we show the covariance test at the 5th, 6th, and 7th steps 
along the lasso path, from left to right, respectively. The solid line has slope 1, while the broken lines have 
slopes 1/2, 1/3, 1/4, as predicted by Theorem^ 



As an example, consider one of the setups from Table [21 with n = 100, p = 80, and predictor 
correlation of the AR(1) form p^~^ L The true model is null, and we test the first predictor to enter 
along the lasso path. (We choose n,p of roughly equal sizes here to expose the differences between 
the a 2 known and unknown cases.) Tabic [4] shows the results of 1000 simulations from each of the 
p = and p = 0.8 scenarios. We see that with a 2 estimated, the i^.n-p distribution provides a more 
accurate finite-sample approximation than does Exp(l). 

When p > n, estimation of tr 2 is not nearly as straightforward; one idea is to estimate a 2 from 
the least squares fit on the support of the model selected by cross-validation. One would then hope 
that the resulting statistic, with this plug-in estimate of c 2 , is close in distribution to i^.n-r under 
the null, where r is the size of the model chosen by cross-validation. This is by analogy to the 
low-dimensional n > p case in (|45p , but is not supported by rigorous theory. Simulations (withheld 
for brevity) show that this approximation is not too far off, but that the variance of the observed 
statistic is sometimes inflated compared that of an i^.n-r distribution (this unaccounted variability 
is likely due to the model selection process via cross-validation). Other authors have argued that 
using cross-validation to estimate a 2 is not necessarily a good approach, as it can be anti-conscrvativc 
when p ^ n; see, e.g., Fan et al. (2012). In future work, we will address the important issue of 
estimating a 2 in the context of the covariance statistic, when p > n. 

7 Real data examples 

We demonstrate the use of covariance test with some real data examples. As mentioned previously, 
in any serious application of significance testing over many variables (many steps of the lasso path) , 
we would need to consider the issue of multiple comparisons, which we do not here. This is a topic 
for future work. 

7.1 Wine data 

Table [5] shows the results for the wine quality data taken from the UCI database. There are p = 11 
predictors, and n = 1599 observations, which we split randomly into approximately equal-sized 
training and test sets. The outcome is a wine quality rating, on a scale between and 10. The table 
shows the training set p- values from forward stepwise regression (with the chi-squared test) and the 
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n 


= 100, p 


= 10 












p 


Equal data 


corr 


Equal pop'n 


corr 


AR(1) 


Block diag 


onal 




Mean 


Var 


Tail pr 


Mean 


Var 


Tail pr 


Mean 


Var 


Tail pr 


Mean 


Var 


Tail pr 





0.966 


1.157 


0.062 


1.120 


1.951 


0.090 


1.017 


1.484 


0.070 


1.058 


1.548 


0.060 


0.2 


0.972 


1.178 


0.066 


1.119 


1.844 


0.086 


1.034 


1.497 


0.074 


1.069 


1.614 


0.078 


0.4 


0.963 


1.219 


0.060 


1.115 


1.724 


0.092 


1.045 


1.469 


0.060 


1.077 


1.701 


0.076 


0.6 


0.960 


1.265 


0.070 


1.095 


1.648 


0.086 


1.048 


1.485 


0.066 


1.074 


1.719 


0.086 


0.8 


0.958 


1.367 


0.060 


1.062 


1.624 


0.092 


1.034 


1.471 


0.062 


1.062 


1.687 


0.072 


se 


0.007 


0.015 


0.001 


0.010 


0.049 


0.001 


0.013 


0.043 


0.001 


0.010 


0.047 


0.001 












n 


= 100, p 


= 50 















0.929 


1.058 


0.048 


1.078 


1.721 


0.074 


1.039 


1.415 


0.070 


0.999 


1.578 


0.048 


0.2 


0.920 


1.032 


0.038 


1.090 


1.476 


0.074 


0.998 


1.391 


0.054 


1.064 


2.062 


0.052 


0.4 


0.928 


1.033 


0.040 


1.079 


1.382 


0.068 


0.985 


1.373 


0.060 


1.076 


2.168 


0.062 


0.6 


0.950 


1.058 


0.050 


1.057 


1.312 


0.060 


0.978 


1.425 


0.054 


1.060 


2.138 


0.060 


0.8 


0.982 


1.157 


0.056 


1.035 


1.346 


0.056 


0.973 


1.439 


0.060 


1.046 


2.066 


0.068 


se 


0.010 


0.030 


0.001 


0.011 


0.037 


0.001 


0.009 


0.041 


0.001 


0.011 


0.103 


0.001 












n 


= 100, p 


= 200 





















1.004 


1.017 


0.054 


1.029 


1.240 


0.062 


0.930 


1.166 


0.042 


0.2 








0.996 


1.164 


0.052 


1.000 


1.182 


0.062 


0.927 


1.185 


0.046 


0.4 








1.003 


1.262 


0.058 


0.984 


1.016 


0.058 


0.935 


1.193 


0.048 


0.6 








1.007 


1.327 


0.062 


0.954 


1.000 


0.050 


0.915 


1.231 


0.044 


0.8 








0.989 


1.264 


0.066 


0.961 


1.135 


0.060 


0.914 


1.258 


0.056 


se 








0.008 


0.039 


0.001 


0.009 


0.028 


0.001 


0.007 


0.032 


0.001 



Tabic 2: Simulation results for the first predictor to enter for a global null true model. We vary the number 
of predictors p, correlation parameter p, and structure of the predictor correlation matrix. Shown are the 
mean, variance, and tail probability P(Ti > q.95) of the covariance statistic T\, where g.95 is the 95% quantile 
of the Exp(l) distribution, computed over 500 simulated data sets for each setup. Standard errors are given 
by "se". (The panel in the bottom left corner is missing because the equal data correlation setup is not defined 
for p > n.) 



lasso (with the covariance test). Forward stepwise enters 6 predictors at the 0.05 level, while the 
lasso enters only 3. 

In the left panel of Figure [5j we repeated this p- value computation over 500 random splits into 
training test sets. The right panel shows the corresponding test set prediction error for the models of 
each size. The lasso test error decreases sharply once the 3rd predictor is added, but then somewhat 
flattens out from the 4th predictor onwards; this is in general qualitative agreement with the lasso 
p-values in the left panel, the first 3 being very small, and the 4th p-value being about 0.2. This 
also echoes the well-known difference between hypothesis testing and minimizing prediction error. 
For example, the C p statistic stops entering variables when the p-value is larger than about 0.16. 

7.2 HIV data 

Rhee et al. (2003) study six nucleotide reverse transcriptase inhibitors (NRTIs) that are used to 
treat HIV-1. The target of these drugs can become resistant through mutation, and they compare a 
collection of models for predicting the (log) susceptibility of the drugs, a measure of drug resistance, 
based on the location of mutations. We focused on the first drug (3TC), for which there arc p = 217 
sites and n — 1057 samples. To examine the behavior of the covariance test in the p > n setting, we 
divided the data at random into training and test sets of size 150 and 907, respectively, a total of 
50 times. Figure |6] shows the results, in the same format as Figure [5j We used the model chosen by 
cross-validation to estimate a 2 . The covariance test for the lasso suggests that there are only one 
or two important predictors (in marked contrast to the chi-squared test for forward stepwise), and 
this is confirmed by the test error plot in the right panel. 
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0.933 
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0.048 
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0.060 


0.2 


0.940 


1.051 


0.046 


1.039 




1.554 


0.082 


1.017 


1.175 


0.060 


1.062 


2.015 


0.062 


0.4 


0.952 


1.126 


0.056 


1.016 




1.548 


0.084 


0.984 


1.230 


0.056 


1.042 


2.137 


0.066 


0.6 


0.938 


1.129 


0.064 


0.997 




1.518 


0.079 


0.964 


1.247 


0.056 


1.018 


1.798 


0.068 


0.8 


0.818 


0.945 


0.039 


0.815 




0.958 


0.044 


0.914 


1.172 


0.062 


0.822 


0.966 


0.037 


se 


0.010 


0.024 


0.002 


0.011 




0.036 


0.002 


0.010 


0.030 


0.002 


0.015 


0.087 


0.002 












k 


= 2 and 3rd predictor to enter 













0.927 


1.051 


0.046 


1.119 




1.724 


0.094 


0.996 


1.108 


0.072 


1.072 


1.800 


0.064 


0.2 


0.928 


1.088 


0.044 


1.070 




1.590 


0.080 


0.996 


1.113 


0.050 


1.043 


2.029 


0.060 


0.4 


0.918 


1.160 


0.050 


1.042 




1.532 


0.085 


1.008 


1.198 


0.058 


1.024 


2.125 


0.066 


0.6 


0.897 


1.104 


0.048 


0.994 




1.371 


0.077 


1.012 


1.324 


0.058 


0.945 


1.568 


0.054 


0.8 


0.719 


0.633 


0.020 


0.781 




0.929 


0.042 


1.031 


1.324 


0.068 


0.771 


0.823 


0.038 


se 


0.011 


0.034 


0.002 


0.014 




0.049 


0.003 


0.009 


0.022 


0.002 


0.013 


0.073 


0.002 












k 


= 3 and 4th predictor to enter 













0.925 


1.021 


0.046 


1.080 




1.571 


0.086 


1.044 


1.225 


0.070 


1.003 


1.604 


0.060 


0.2 


0.926 


1.159 


0.050 


1.031 




1.463 


0.069 


1.025 


1.189 


0.056 


1.010 


1.991 


0.060 


0.4 


0.922 


1.215 


0.048 


0.987 




1.351 


0.069 


0.980 


1.185 


0.050 


0.918 


1.576 


0.053 


0.6 


0.905 


1.158 


0.048 


0.888 




1.159 


0.053 


0.947 


1.189 


0.042 


0.837 


1.139 


0.052 


0.8 


0.648 


0.503 


0.008 


0.673 




0.699 


0.026 


0.940 


1.244 


0.062 


0.647 


0.593 


0.015 


se 


0.014 


0.037 


0.002 


0.016 




0.044 


0.003 


0.014 


0.031 


0.003 


0.016 


0.073 


0.002 



Tabic 3: Simulation results for the (k + l)st predictor to enter for a model with k truly nonzero coefficients, 
across k = 1, 2, 3. The rest of the setup is the same as in Tabled except that the dimensions were fixed at 
n = 100 and p = 50. 



8 Extensions 

We discuss some extensions of the covariance statistic, beyond significance testing for the lasso. The 
proposals here are supported by simulations [in terms of having an Exp(l) null distribution], but we 
do offer any theory. This may be a direction for future work. 

8.1 The elastic net 

The elastic net estimate (Zou & Hastie 2005) is defined as 

r = argmin \\\y - Xp\\\ + A||^f|| a + (46) 



2"" "' " 2 



lasso estimate with predictor matrix X 



<= ]R(«.+p)xp ] an d outcome y = (y,0) e R n+P . 



where 7 > is a second tuning parameter. It is not hard to see that this can actually be cast as a 

X 

This shows that, for a fixed 7, the elastic net solution path is piecewise linear over A, with each knot 
marking the entry (or deletion) of a variable from the active set. We therefore define the covariance 
statistic in the same manner as we did for the lasso; fixing 7, to test the predictor entering at the 
kth step (knot Afc) in the elastic net path, we consider the statistic 

T k = ((y,X/3 cn (A fe+1 , 7 )) - (y,X A pf(\ k+1 , 7 )))/a 2 , 

where as before, Afc+i is next knot in the path, A is the active set of predictors just before Afc, and 
j3 A n is the elastic net solution using only the predictors Xa- The precise expression for the elastic 
net solution in ((35]), for a given active set and signs, is the same as it is for the lasso (see Section 
[23)1 . but with (X^Xa)- 1 replaced by (X%X A + -jl)' 1 . This generally creates a complication for 
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1st predictor to enter 4th predictor to enter 
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Effect size Effect size 

Figure 4: Estimated power curves for significance tests using forward stepwise regression and the drop in 
RSS statistic, as well as the lasso and the covariance statistic. The results are averaged over 1000 simulations 
with n = 100 and p = 10 predictors drawn i.i.d. from N(0, 1), and o 2 = 1. On the left, there is one truly 
nonzero regression coefficient, and we varied its magnitude (the effect size parameter on the x-axis). We 
examined the first step of the forward stepwise and lasso procedures. On the right, in addition to a nonzero 
coefficient with varying effect size (on the x-axis), there are 3 additional large coefficients in the true model. 
We examined the 4th step in forward stepwise and the lasso, after the 3 strong variables have been entered. 
For the power curves in both panels, we use simulation-based outpoints for forward stepwise to control the 
type I error at the 5% level; for the lasso we do the same, but also display the results for the theoretically-based 
/Exp(l)/ outpoint. Note that in practice, simulation-based outpoints would not typically to be available. 



the theory in Sections [3] and 2J But in the orthogonal X case, we have (X^X^ + jl) 1 =1/(1+7) 
and so 

T k = 1/(1 + 7) " \U( k )\(\U (k) \ - \U (k+1) \)/a 2 , 
with Uj = Xjy, j = 1, . . .p. This means that for an orthogonal X, under the null, 

(1 + 7) ■ T fc 4 Exp(l), 

and one is tempted to use this approximation beyond the orthogonal setting as well. In Figure [71 wc 
evaluated the distribution of (l + 7)Ti (for the first predictor to enter), for orthogonal and correlated 
scenarios, and for three different values of 7. Here n = 100, p = 10, and the true model was null. It 
seems to be reasonably close to Exp(l) in all cases. 

8.2 Generalized linear models and the Cox model 

Consider the estimate from an ^-penalized generalized linear model: 

n 

/3S lm = argmin - V log f( yi ; x u (3) + X\\/3\\i, (47) 
pew fr( 

where /(yf, Xi, f3) is an exponential family density, a function of the predictor measurements Xi £ W 
and parameter (3 € W. Note that the usual lasso estimate in is a special case of this form when 



2G 



p = 




Mean 


Variance 


95% quantilc 


Tail prob 


Observed 


1.17 


2.10 


3.75 




Exp(l) 


1.00 


1.00 


2.99 


0.082 


i*2,n-p 


1.11 


1.54 


3.49 


0.054 


p = 0.8 


Observed 


1.14 


1.70 


3.77 




Exp(l) 


1.00 


1.00 


2.99 


0.097 


F2,n-p 


1.11 


1.54 


3.49 


0.064 



Table 4: Comparison o/Exp(l), F2.N-P, and the observed (empirical) null distribution of the covariance 
statistic, when a 2 has been estimated. We examined 1000 simulated data sets with n = 100, p — 80, and 
the correlation between predictors j and j' equal to p'-' - -' '. We are testing the first step of the lasso path, 
and the true model is the global null. Results are shown for p — 0.0 and 0.8. The third column shows the 
tail probability P(Ti > go. 95) computed over the 1000 simulations, where g.95 is the 95% quantile from the 
appropriate distribution (either Exp(l) or F2, n -p)- 





Forward stepwise 






Lasso 






Step 


Predictor 


RSS test 


p- value 


Step 


Predictor 


Cov test 


p- value 


1 


alcohol 


315.216 


0.000 


1 


alcohol 


79.388 


0.000 


2 


volatile_acidity 


137.412 


0.000 


2 


volatile_acidity 


77.956 


0.000 


3 


sulphates 


18.571 


0.000 


3 


sulphates 


10.085 


0.000 


4 


chlorides 


10.607 


0.001 


4 


chlorides 


1.757 


0.173 


5 


pH 


4.400 


0.036 


5 


total_sulfur_dioxide 


0.622 


0.537 


6 


total jsulfur .dioxide 


3.392 


0.066 


6 


pH 


2.590 


0.076 


7 


residuaLsugar 


0.607 


0.436 


7 


residuaLsugar 


0.318 


0.728 


8 


citric_acid 


0.878 


0.349 


8 


citric_acid 


0.516 


0.597 


9 


density 


0.288 


0.592 


9 


density 


0.184 


0.832 


10 


fixed_acidity 


0.116 


0.733 


10 


free_sulfur_dioxide 


0.000 


1.000 


11 


free_sulfur .dioxide 


0.000 


0.997 


11 


fixed_acidity 


0.114 


0.892 



Table 5: Wine data: forward stepwise and lasso p-values. The values are rounded to 3 decimal places. For 
the lasso, we only show p-values for the steps in which a predictor entered the model and stayed in the model 
for the remainder of the path (i.e., if a predictor entered the model at a step but then later left, we do not 
show this step — we only show the step corresponding to its last entry point). 



f is the Gaussian density with known variance a 2 . The natural parameter in (|47[) is 77.; = xf/3, for 
i = 1, . . . n, related to the mean of yi via a link function <j(E[yj|a;,]) = r\i. 

Having solved (|47[) with A = (i.e., this is simply maximum likelihood), producing a vector of 
fitted values f\ = X/3 glm <G R", we might define degrees of freedom at@ 

n 

df0) = X)Cov(y i) fc). (48) 

i=l 

This is the implicit concept used by Efron (1986) in his definition of the "optimism" of the training 
error. The same idea could be used to define degrees of freedom for the penalized estimate in (|47|) 
for any A > 0, and this motivates the definition of the covariance statistic, as follows. If the tuning 
parameter value A = A^ marks the entry of a new predictor into the active set A, then we define the 

6 Note that in the Gaussian case, this definition is actually a 2 times the usual notion of degrees of freedom; hence in 
the presence of a scale parameter, we would divide the right-hand side in the definition 1481 by this scale parameter, 
and we would do the same for the covariance statistic as defined in 1481 1. 
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Figure 5: Wine data: the data were randomly divided 500 times into roughly equal-sized training and test 
sets. The left panel shows the training set p-values for forward stepwise regression and the lasso. The right 
panel show the test set error for the corresponding models of each size. 



covariance statistic 

T k = (y,X/3s lm (A fc+ i)) - (y,X A ^(X k+1 )}, (49) 

where Afc+i is the next value of the tuning parameter at which the model changes (a variable enters 
or leaves the active set), and /3^ lm is the estimate from the penalized generalized linear model (|4T|) 
using only predictors in A. Unlike in the Gaussian case, the solution path in (|47|) is not generally 
piecewise linear over A, and there is not an algorithm to deliver the exact the values of A at which 
variables enter the model (we still refer to these as knots in the path). However, one can numerically 
approximate these knot values; e.g., see Park & Hastie (2007). By analogy to the Gaussian case, 
we would hope that T k has an asymptotic Exp(l) distribution under the null. Though we have not 
rigorously investigated this conjecture, simulations seem to support it. 

As as example, consider the logistic regression model for binary data. Now r\i = log(/^/(l — /i ? )), 
with [Li = P(j/j = \\xi). Figure [5] shows the simulation results from comparing the null distribution 
of the covariance test statistic in (|49p to Exp(l). Here we used the glmpath package in R (Park & 
Hastie 2007) to compute an approximate solution path and locations of knots. The null distribution 
of the test statistic looks fairly close to Exp(l). 

For general likelihood-based regression problems, let r\ = X/3 and £(rf) denote the log likelihood. 
We can view maximum likelihood estimation as an iteratively weighted least squares procedure using 
the outcome variable 

z(v)=V + I^S v (50) 

where S v = V£(r?), and I v = V 2 l{r)). This applies, e.g., to the class of generalized linear models and 
Cox's proportional hazards model. For the general ^-penalized estimator 

/J lik = argmin-£pf/3) + A||/?||i, (51) 

we can analogously define the covariance test statistic at a knot Afc, marking the entry of a predictor 
into the active set A, as 

Tk = (</ - 1/2 S ,X/3 lik (A fc+1 )) - (l- 1/2 S Q ,XJf(\ k+1 )))/2, (52) 
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Figure 6: HIV data: the data were randomly divided 50 times into training and test sets of size 150 and 
907, respectively. The left panel shows the training set p-values for forward stepwise regression and the lasso. 
The right panel shows the test set error for the corresponding models of each size. 



with A/t+i being the next knot in the path (at which a variable is added or deleted from the active 
set), and (3 l A k the solution of the general penalized likelihood problem (|51[) with predictor matrix 
Xa- For the binomial model, the statistic (|52"j) reduces to expression In Figure^ we computed 
this statistic for Cox's proportional hazards model, using a similar setup to that in Figure [5] The 
Exp(l) approximation for its null distribution looks reasonably accurate. 

9 Discussion 

We proposed a simple covariance statistic for testing the significance of predictor variables as they 
enter the active set, along the lasso solution path. We showed that the distribution of this statistic 
is asymptotically Exp(l), under the null hypothesis that all truly active predictors are contained in 
the current active set. (See Theorems [TJ [2j and[3j the conditions required for this convergence result 
vary depending on the step k along the path that we are considering, and the covariance structure 
of the predictor matrix X; the Exp(l) limiting distribution is in some cases a conservative upper 
bound under the null.) Such a result accounts for the adaptive nature of the lasso procedure, which 
is not true for the usual chi-squared test (or F-test) applied to, e.g., forward stepwise regression. An 
R package covTest for computing the covariance test will be made freely available on the CRAN 
repository. 

We feel that our work has shed light not only on the lasso path (as given by LARS), but also, at 
a high level, on forward stepwise regression. Both the lasso and forward stepwise start by entering 
the predictor variable most correlated with the outcome (thinking of standardized predictors), but 
the two differ in what they do next. Forward stepwise is greedy, and once it enters this first variable, 
it proceeds to fit the first coefficient fully, ignoring the effects of other predictors. The lasso, on the 
other hand, increases (or decreases) the coefficient of the first variable only as long as its correlation 
with the residual is larger than that of the inactive predictors. Subsequent steps follow similarly. 
Intuitively, it seems that forward stepwise regression inflates coefficients unfairly, while the lasso 
takes more appropriately sized steps. This intuition is confirmed in one sense by looking at degrees 
of freedom (recall Section ^. 4[) . The covariance test and its simple asymptotic null distribution reveal 
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Figure 7: Elastic net: an example with n = 100 andp = 10, for orthogonal and correlated predictors (having 
pairwise population correlation 0.5), and three different values of the ridge penalty parameter 7. 



another way in which the step sizes used by the lasso are "just right" . 

The problem of assessing significance in an adaptive linear model fit by the lasso is a difficult one, 
and what we have presented here is by no means a complete solution. There are many directions in 
need of future work; it is our hope that the current paper will stimulate interest in such topics, and 
at some point, related work will culminate in a full set of inferential tools for the lasso, and possibly 
for other commonly-used adaptive procedures. We describe some ideas for future projects below. 

• Significance test for generic lasso models. Of particular of interest to us is the generic lasso 
testing problem: given a lasso model at some fixed value of A, how do we carry out a significance 
test for any predictor in the active set? We believe that the same covariance statistic can be 
used, with some modification to its asymptotic limiting distribution. The full details are not 
yet worked out. 

• Proper p-values for forward stepwise. Interestingly, a test analogous to the covariance test can 
be used in forward stepwise regression, to provide valid p-values for this greedy procedure. We 
will follow-up this idea in a future paper. 

• Asymptotic convergence in joint distribution. For an orthogonal X , we proved that the covari- 
ance statistic across adjacent steps converges jointly to a vector of independent exponentials. 
Some clever combination of the test statistics at adjacent steps might then allow us to perform 
efficient significance tests between models that differ in size by more than 1 variable, e.g., to 
test the significance of the 7-step model versus and the 3-step model directly. Another way of 
thinking about this (with credit to Jacob Bien) : if the covariance statistic is thought of as the 
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Figure 8: Lasso logistic regression: an example with n — 100 and p = 10 predictors, i.i.d. from N(0, 1). In 
the left panel, all true coefficients are zero; on the right, the first coefficient is large, and the rest are zero. 
Shown are quantile-quantile plots of the covariance tet statistic (at the first and second steps, respectively), 
generated over 500 data sets, versus its conjectured asymptotic distribution, Exp(l). 



analogy to the t-statistic in ordinary least squares, then such a test would be analogous to an 
F-test. We have not yet investigated joint weak convergence in the general X case. 

• Other related problems: estimation and control of false discovery rates with the covariance test 
p-values; estimation of a 2 when p > n; power calculations and confidence interval estimation; 
asymptotic theory for lasso models with strong and weak signals (large and small true coeffi- 
cients); asymptotic theory for the elastic net, generalized linear models, and the Cox model; 
higher-order asymptotics, to shed light on the quality of the Exp(l) approximation for small 
p; convergence of the test statistic under non- normal errors, relying on central limit theorem 
results. 
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A Appendix 

A.l Proof of Lemma Q] 

By continuity of the lasso solution path at Xk , 



PaV - \ k (X A ) + s A = P A u{j}V - >>k(X^ u{j} ) + s Au{j} , 



and therefore 



(Pau{ ]} - Pa)v = ^k((Xl u{j} ) + s Au{j} - {X T A )+s A ). 



(53) 



From this, we can obtain two identities: the first is 



y T (P A u {]} - Pa)v = \l ■ \\(xl uW ) + s AU{l} - {xl)+ SA \\l 



(54) 
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Figure 9: Lasso Cox model estimate: the basic setup is the same as in Figure^ (n, p, the distribution of the 
predictors X, the true coefficient vector — on the left, entirely zero, and on the right, one large coefficient). 
Shown are quantile-quantile plots of the covariance test statistic (at the first and second steps, respectively), 
generated over 500 data sets, versus the Exp(l) distribution. 



obtained by squaring both sides in f|53[) (more precisely, taking the inner product of the left-hand 
side with itself and the right-hand side with itself), and noting that (PAu{j} ~ Pa) 2 = Pau{j} ~ Pa', 
the second is 

y T ((Xl uU} ) + s Au{j} - (Xj)+ SA ) = A fe • \\(X^ UU} ) + SAU{J} - {Xl) + s A \\l (55) 

obtained by taking the inner product of both sides in (|53[) with y, and then using (|54j) . Plugging 
(|54)) and (f55j) in for the first and second terms in ([7]), respectively, then gives the result in ((9]). □ 



A. 2 Proof of Lemma [3] 

Dchnc Wi = (Vi — a p )/b p for i = 1, . . . k + 1, with a p = — 1/p) and b p = pF'(a p ), as in the 

proof of Lemma[U Then Theorem 2.1.1 in de Haan & Ferreira (2006) shows that 

(W U W 2 ,... W k+1 ) A ( - log E 1 , - \og(E 1 +E 2 ),...- log(E 1 + E 2 + . . . + E k+1 )), 

where E\, . . . E k +i are independent standard exponential variates. By the same arguments as those 
given for Lemma [2] 

Vi(Vi - V i+ i) =Wi- Wi+i + op(1) all % = 1, . . . k + 1, 

and so 

(VxiVx - V 2 ), V 2 (V 2 -V 3 ),..., V k (V k - V k+ i)) 
A (\og(E 1 +E 2 )-\ogE 1 , log(E 1 +E 2 +E 3 )-log(E 1 +E 2 ), ...log(j2 E i) -logfe^) J. 



Now let 

Ei + . . . + Ei . 

L>i = lor i = 1, . . . k, 

Ei + . . . + Ei+i 

and D k+ i = Ei + . . . + E k+ i. A change of variables shows that Di, . . . , D k+ i are independent. For 
each i = 1, . . . k, the variable Di is distributed as Beta(i, 1) — this is the distribution of the largest 
order statistic in a sample of size i from a uniform distribution. It is easily checked that — log Di 
has distribution Exp(l/i), and the result follows. □ 
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A. 3 Proof of Lemma H 

Note that 

/„. w /■ n ^ gCjirgi) ~ ss^j^gjjusi) g{j,s)-ss 1 R jJl g(j 1 ,s 1 ) 
ffUi) s i) > 9\Ji s ) : - > 



1 - ss 1 R jJl 
g(j 1 ,s 1 )>h^\j,s), 



1 - ssxRjj, 



the first step following as l — ss\Rjj 1 > 0, and the second step following from the definition of h^ 1 '" 1 ^. 
Taking the intersection of these statements over all j, s [and using the definition of M(ji, s±)] gives 
the result. □ 

A. 4 Proof of Lemma [5] 

By l'Hopital's rule, 

$(w(t, m)) (j)(u(t,m)) du(t,m) 

hm — -rr-, — ; — - = hm 



$(m) m-s-oo cj){m) 
where <\> is the standard normal density. First note that 

d(t,m) 1 m 
dm 2 2Vm 2 + At 

Also, a straightforward calculation shows 



dm 



— > 1 as m — > oo. 



log 4>{u{t, to)) — log (/>(?n) = (l — y/T+ it/m 2 ) — — — > — t as to — > oo, 



where in the last step we used the fact that (1 — yl + At/m 2 )/(2/m 2 ) —> —t/2, again by l'Hopital's 
rule. Therefore <j)(u(t,m)) / '</>(m) — ¥ e _t , which completes the proof. □ 

A. 5 Proof of Lemma U 

Fix e > 0, and choose too large enough that 

&(u(t, m/a)) 



$(to/ct) 



< e for all to > toq. 



Starting from ([22]). 

|P(Tr >t)- e -*| < ^ 



$>(u(t, m/a)) 



$(m/a)F M (j li8l )(dm) 



$(m/cr) 

/■ oo puiq 
3i,si Jm ° jusi J ° 

< e J2 > M(7i >Sl )) + ^ P(M(j 1)Sl ) < m ), 



.31, Si 



Above, the term multiplying e is equal to 1, and the second term can be made arbitrarily small (say, 
less than e) by taking p sufficiently large. □ 
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A.6 Proof of Theorem [2] 

We will show that for any fixed m > and j\, S\, 

P(M(ji,*i)<mo) <cl s l, (56) 

where S C {1, . . .p} \ is as in the theorem for j = j\, with size \S\ > d p , and c < 1 is a constant 
(not depending on ji ) . This would imply that 

P(M(ji,si) < m ) < 2p • as p ->• oo, 

where we used the fact that d v j logp —> oo by (|2"5")l . The above sum tending to zero now implies the 
desired convergence result by Lemma [6j and hence it suffices to show (|56p . To this end, consider 

M(h, Sl )= max sU j~ sR ^ U n 

3^31,8 1 — SSiKjj 1 
> max 



> max 



J65 2 

where in both inequalities above we used the fact that \Rj,j 1 \ < 1. We can therefore use the bound 

P(M(i!, Sl ) < 77i ) < P(|^| < ^o, i G S), 

where we define Vj = (Uj — Rj_j 1 Uj 1 )/2 for j 6 5. Let r = \S\, and without a loss of generality, let 
5 = { 1 , . . . r} . We will show that 

n\Vi\ <m ,...|K| <mo) <c r , (57) 

for c = $(2mo/(cr(5)) — $(— 2mo/(aS)) < 1, by induction; this would complete the proof, as it would 
imply ()56[) . Before presenting this argument, we note a few important facts. First, the condition in 
(|24|) is really a statement about conditional variances: 

Var(C/ 4 1 tf/, I eS\{i}) - a 2 ■ [l - i? J ,s\{,}(i?5\{ I },s\{ l })" 1 ^s\{,}, J ] > <? 2 5 2 for all i G S, 

where recall that Uj = Xjy, j — 1, . . .p. Second, since Ui, . . . U r arc jointly normal, we have 

Var(U i \U t ,£<= S') > Var([/ 4 \U e ,£eS\ {i}) > a 2 5 2 for any S'C5 \ {i}, and i e 5, (58) 

which can be verified using the conditional variance formula (i.e., the law of total variance). Finally, 
the collection Vi, ... V r is independent of Uj 1 , because these random variables are jointly normal, 
and E[V;-E/j-J = for all j = 1, . . . r. 

Now we give the inductive argument for l|57p . For the base case, note that V\ ~ AT(0, r 2 ), where 
its variance is 

t\ = Var(V4) = Var(V4|C/ J - 1 ) = Vax(J7i)/4 > cr 2 <5 2 /4, 

the second equality is due to the independence of V\ and Uj 1 , and the last inequality comes from 
the fact that conditioning can only decrease the variance, as stated above in (|58|) . Hence 

P(|Vi| < m ) = $(m /ri) - $(-mo/n) < $(2m /H)) - *(-2mo/(cr<5)) = c 

Assume as the inductive hypothesis that P(|Vi| < mo, . . . \V q \ < mo) < c q . Then 

H\Vi\ <m 0l ...\V q+ i\ < m ) = P(|F g +i| <m ||Vi| <m ,...|F ? | < m ) • c q , 
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We have, using the independence of V\, . . . V g +i and Uj ± , 

V q+1 \V u ...V g ±V g+1 \V 1 ,...V q ,U jl 
±V q+ i\Ui,...U q ,U jl 
= N(0,r 2 +1 ), 

where the variance is 

r 2 q+l = Var(T/ g+1 \Ux,... U q , U n ) = Var([/ 9+1 | Ui, . . . U q )/4 > a 2 5 2 /A, 

and here we again used the fact that conditioning further can only reduce the variance, as in 
Therefore 

n\V g +i\ < m | Vi, . . . V q ) < $(2m /(a5)) - $(-2m /(a5)) = c, 

and so 

P(|Vi| < m , • . • \V q+1 \ < mo) < c ■ d> = c«+\ 
completing the inductive step. □ 

A. 7 Proof of Lemma [7] 

Notice that 

g{jk,s k ) > g(j, s) <^ g{jk,sk){i - E^/E^-) > ffC?', s ) - (^j,j'/^jj)9{jk, s k ). 

We now handle division by 1 — /^jj m three cases: 
• if 1 — ~Ej_ji/T,jj > 0, then 



if 1 - Y.j^/Y.jj < 0, then 

g{3k,s k ) > g{j,s) <^ g[j k ,s k )< — , 

• if 1 — Sj.j' /Sjj = 0, then 

g(jk,s k ) > g(j, s) <s> > s) - (E ji f/Y, jj )g(jk, s k ). 

Taking the intersection of these statements over all j, s then gives the result in the lemma. □ 
A. 8 Proof of Lemma [8] 



Define a k = a j y/ C(j k ,s k ) and u(a, b) = (b + y/b 2 + 4a)/2. Exactly as before (dropping for simplicity 
the notational dependence of g, M + on j k , s k ), 

g(g - M+)/4 >t, g> M+ & g/a k > u(t, M+/a k ). 

Therefore we can rewrite (1551) as 



»(T fc > t) = P(?Uk,*k)/<Tk > u(t,M + (j k ,s k )/a k ), g(j k ,s k ) < M~(J k ,s k ), > M (j fc ,s fc )). 
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Note that we can upper bound the right-hand side above by replacing g(jk, s k ) < M (jk, s k ) with 

g(jk,s k ) < M~(j k ,s k ) + u(t<rl,M + (j kl s k )) - M + (j kl s k ), 

because u(a, b) > b for all a > and b. Furthermore, Lemma HU1 (Appendix IA. lTj) shows that indeed 
o\ = a 2 /C(j k ,s k ) = Var(g(j k , s k )) for fixed j k ,s k , and hence g(j k ,s k )/a k is standard normal for 
fixed j k ,s k . Therefore 



\f k >t)<j2 

jk ,Sk ' 



$(m /a k + u(t,m + /a k ) - m + /a k ) - <5>(u(t,m + / o k )) ■ 

Gj k . Sk (dm + ,dm~ ,dm), (59) 



where 



Gj k ,s k {dm + ,dm , dm ) = l{m + < m , m° < 0} • FM+(j k ,s k ),M-(j k , Sh ),M Uk,sk)( dm+ > dm > rf m°), 

the joint distribution of M + (j k ,s k ), M~(j k , s k ),M°(j k , s k ), and 



with F, 



M+ (Jk,Sk),M~ Uk,Sk),M°(jk,Sk) 



we used the fact that g is independent of M + , M , M° for fixed j k ,s k . From (|59f) 

'§(m-/a k + u(t,m+/a k ) - m+/a k ) - $(u(t,m+ /a k )) 



»(f k >t)-e- t =Y^ J 



3k,Sk 



<f>(m-/a k ) - $(m+/a k ) 



$(m-/<x k )-®(m + /a k ) ■ G 3k , Sk (dm+ ,dm~ ,dm°), (60) 



where we here used the fact that 

J $(m"/cr fc ) - $(m+/cr fe ) G jk , Sk {dm + ,dm-,dm°) 

jk,Sk 

= p (-9(.?fe' s fe) > M+{j k ,s k ), g(j k ,s k ) < M-(j k ,s k ), > M°(jk,8 k )) = h 

jk,S k 

because by Lemma [71 each term in the above sum is exactly the probability of j k , s k maximizing g. 
We show in Lemma [TT] (Appendix IA.12[) that 

$(m~ + u(t, m+) - m + ) - <&(u(t, m + )) 
lim — — ^— — < e 

m+^oo $(m")-$(m+) 

provided that m~ > m + . Hence fix e > 0, and choose mo sufficiently large, so that for each k, 
$(m^/a k + u(t,m+/a k ) - m+/a k ) - $(u(t, m+/a k )) 



<f>(m-/a k ) - $(m+/a k ) 



e- 1 < e, 

for all m~/a k > m + ja k > mo. 



Working from (|CU|) . 

P(f k >t)-e-'<e^ 

jk,$k 



m+ /a k > m 



$(m /(T fc ) - $(m + /cr fc ) G jk , Sk (dm + ,dm ,dm°) 



E 



m+ jak < mo 



Gj ktSk {dm + ,dm ,dm°). 



Note that the first term on the right-hand side above is < e, and the second term is bounded by 
y~]j k Sk P(M + '(jk, s k ) < mocr k ), which by assumption can be made arbitrarily small (smaller than, 
say, e) by taking p large enough. □ 
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A. 9 Proof of Lemma [9] 

For now, we reintroduce the notational dependence of the process g on A, sa, as this will be impor- 
tant. We show in Lemma 1121 (Appendix rA.13[) that for any fixed j k ,s k ,j,s, 

g^U,s) - V jk j/X jk , jk )9^ti k ,s k ) jW y ,) - ;)| 

1 — ^jk,j f^jkjk 

where S Jfc j = ^[g^ A ' SA '(j k ,s k ),g^ A ' SA '(j,s)], as given in (|28p. and as usual, SAu{j fc } denotes the 
concatenation of sa and s k . According to its definition in (|33[) . therefore, 

M + (j k ,s k )= max g( AU V>'}>°Auuo)(j,s), 
(j»es+(jjt,sjt) 

and hence on the event Sfc), since we have g(j k , s k ) > M + (j k , s k ), 

M+(j fe)Sfe ) = max 5 (^{ Jfc },^uo fc }) (j - s ) . ij^ttJ^uo,})^ s ) < g^)(j k , s k )\ 
< max 5 (Au ^ fc >' s - 4u O fc })(j, s ) . l| 5 ( Au ^}'^uo fc} )(j iS ) < g( A > SA \j k ,s k )} 
= M(j k ,s k ). 

This means that (now we return to writing g( A > s ^> as for brevity) 

J2 ^({c(j k ,s k )-g(j k , Sk )(g(j k ,s k )~M(j k , Sk ))/<j 2 >t}nE(j k ,s k ^ < 

3k,Sk 

V[[{c(j k ,s k ) ■ g(j k ,s k )(g(j k ,s k ) - M+ ( ]k , s k )) / a 2 > t} n E(j k , s fc )) , 
and so lim^oo P(7i. > t) < limp_>oo P(T k > t) < e - *, the desired conclusion. □ 



A. 10 Proof of Theorem U 

Since we are assuming that P(B) — > 1, we know that P(Tfc > £|i?) — P(T k > t) — > 0, so we only need 
to consider the marginal limiting distribution of T k . We write A = Aq and sa = sa ■ The general 
idea here is similar to that used in the proof of Theorem [3] Fixing mo and j k , s k , we will show that 

V(M+{j k ,s k ) <m a k ) < c^, (61) 

where S C {1, . ,.p}\(AU {jfc}) is as in the statement of the theorem for j = j k , with size \S\ > d p , 
and c < 1 is a constant (not depending on j k ). Also, as in the proof of Lemma 13 we abbreviated 
Ofc = cr I y C(j k , s k ). This bound would imply that 

P{M+( Jk ,s k ) < m <J k ) <2p-c d ^0 asp^M, 

jk ,Sk 

since d p / logp — > 0. The above sum converging to zero is precisely the condition required by Lemma 
[51 which then gives the desired (conservative) exponential limit for T k . Hence it is suffices to show 
(|61j) . For this, we start by recalling the definition of M + in (|33|) : 



M+( jk , s k ) = max ^ §) = (^AdMih gjO , 

wherc 5 + (i fe) s fe ) = {(j,s):^-< 
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Here we write j = ¥,[g(j k , s k )g(j, s)]; note that "Sj k .j k = <rt (as shown in Lemma |10[) . First we 
show that the conditions of the theorem actually imply that S + (j k ,s k ) 3 S x {— 1, 1}> S >, This is 
true because for j G S and any s G { — 1,1}, we have | -Rj , j fe \/Rj k ,j k < ^7/ (2 — 77) by P2"j) . and 



i2, w > s~Xf{X T A )+s A 
^jkJ/^jkJk < 1- 



< 1 



The first implication uses the assumption (j40]). as |sfc — Xj k (X A ) + SA\ < 1 + ||(X J 4) + Xj fe ||x < 2 — 77 
and |s — Xj(Xj) + ,s J 4| > 1 — j| (^Ci^^'JI 1 > ?7, and the second simply follows from the definition 
of Y,j k j and S j k _j k . Therefore 

M+(j k ,s k ) > max g(jV?) %iZ%j , 

Let C/j = Xj(I — Pa)u and = Rj k ,j /Rj k .j k for j G 5. By the arguments given in the proof of 
Lemma Il2( we can rewrite the right-hand side above, yielding 

M + (j k ,s k )> max U i ~ i>A 



3 es,s s-Xj(X Au{jk} )+s AU{jk} 



max 



iis;; i-sxT(xT u{jk} )+ SAU{jk} 
\Uj-e jk>j u jh \ 



> max 



l + \Xf(XT u{jk} )+s AU{jk} \ 



> max 



'3 "3k,3^3k\ 



where the last two inequalities above follow as \Xj (XAu{j k }) + s Au{j k }\ < 1 for all j G 5, which itself 
follows from the assumption that || (^" J 4u{j fc }) + ^j lloo < 1 for all j <G 5, in P3"|) . Hence 

P(M+(j fc)Sfe ) < m 0( x fe ) < P(\V 3 \ < m a k , j G 5), 

where V, = (Uj — 0j k ,jUj k )/2. Writing without a loss of generality r = and S = {1, . . . r}, it now 
remains to show that 

P(|Vi| < m a k , . . . \V r \ < m a k ) < c r . (62) 

Similar to the arguments in the proof of Theorem [3J we will show (|62[) by induction, for the constant 
c = $(2moV / C/((5?7)) — $(— 2moVC / (Srj)) < 1. Before this, it is helpful to discuss three important 
facts. First, we note that (|4T|) is actually a lower bound on the ratio of conditional to unconditional 
variances: 

Var((7 l \U i ,£eS\ {i})/Var(t^) - [ifc - i? i ,5\{ J :}(^s\{ i },s\{ i })" 1 ^5\{ J }, i ] > £ 2 

for all i G S. 

Second, conditioning on a smaller set of variables can only increase the conditional variance: 

Vax(Ui \U e ,i£ S') > Var(£/j \U e ,£eS\ {£}) > 5 2 a 2 R vl for any S' CS\ {i}, and i G 5, 

which holds as U\, . . . U r are jointly normal. Third, and lastly, the collection Vi, . . . V r is independent 
of Uj k , since these variables are all jointly normal, and it is easily verified that E[Vji7j fc ] = for each 
3 = 1, •••»•. 
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We give the the inductive argument for (pjiZj) . For the base case, we have V\ ~ N(0,t 2 ), where 

t( = Var(Vi) = Var(Vi\U jk ) = Var(f/i)/4 > <S 2 cr 2 i?ii/4. 

Above, in the second equality, we used that V± and Uj k are independent, and in the last inequality, 
that conditioning on fewer variables (here, none) only increases the variance. This means that 

P(|Fi| < moa k ) < V(\Z\ < 2m„«T t /(fo/^)) < ¥(\Z\ < 2moVC/(5r])) = c, 

where Z is standard normal; note that in the last inequality above, we applied the upper bound 



< 



otRu otRu i?n [ Sfe _ X J k {X T A )+ s A } 2 ~ V 2 ' 
Now, for the inductive hypothesis, assume that P(|Vi| < m <Jk, ■ ■ ■ \ V q \ < moa k ) < c q . Consider 
P(|Vi| < m a k , . . . \V q+ i\ < m a k ) = ¥(\V q+1 \ < m a k \\Vi\< m a k , ■ ■ . \V q \ < m a k ) ■ c q . 
Using the independence of V\ 1 . . . V q +\ and Uj k , 

V q+1 \Vi,...V g ±V q+ i\V u ...V g ,U jk 

= v q+1 \u 1} ... u q , u jk 

= N(0,r 2 +1 ). 

The variance t 2 +1 is 

r 2 q+1 = Var(U 9+1 \U U ... U q , U jk ) = Var([/ 9+1 \U U ... U q )/4 > 5 2 a 2 R q+hq+1 /4, 

where we again used the fact that conditioning on a smaller set of variables only makes the variance 
larger. Finally, 

P(\V q+1 \ < mo<7 fc | Vi, . . . V q ) < F(\Z\ < 2m (T k /(8ay/R q+1 , q+1 )} < P{\Z\ < 2m VC/{6r])) = c, 

where we used al/(a 2 R q+ i :q+ i) < C/rj 2 as above, and so 

PflVil < m a k ,...\V q+1 \ < m a k ) <c-c q = c q+1 . 
This completes the inductive proof. □ 

A. 11 Statement and proof of Lemma 1101 

Lemma 10. For any fixed A, s A , and any j A, s £ { — 1, 1}, we have 

Var(.g0,s)) = 



[s-Xj(Xl)+ SA Y \\(XT u{j} )+s AU{j} - (XZ)+s A \\r 

where s Au ij\ denotes the concatenation of s A and s. 
Proof. We will show that 

X T {I-P A )X T = ll^um) s auM - (XaFsaM- (63) 
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The right-hand side above, after a straightforward calculation, is shown to be equal to 

s AU{j}( X Au{j} X AU{j})~ 1 SAU{j} ~ S A (X A X A )~ 1 S A . 

X A X A x A Xj 



Now let z = (Xj a r j} Iiu{j}) 1 s Au { J }. In block form, 





Zl 




s A 




Z2 




s 



xjx A Xj Xj 

Solving for z\ in the first row yields 

«i = (xIXa^sa - {X A )+X jZ2 , 

and therefore is equal to 

s t aZi + sz 2 - s T A (XlX A )- l s A = [s - s T A {X A ) + X ]z 2 . 

Solving for z 2 in the second row of (|65[) gives 

s - a^{X A )+Xj 
22 XJ(I-P A )X/ 

Plugging this value into ([SB]) produces the left-hand side in completing the proof. 

A. 12 Statement and proof of Lemma 1111 
Lemma 11. If v = v(m) satisfies v > m, then for any t > 0, 

$(w + u(t,m) — m) — $(u(t,m)) ^ _ t 

m->oc — $(to) 



(64) 



(65) 



(66) 



□ 



Proof. First note, using a Taylor series expansion of yl + 4t/m 3 , that for sufficiently large m, 

£ t 2 
it(f, m) > m H — . 

Also, a simple calculation shows that d(u(t,m) — m)/dm < for all m, so that 

u(t, w) — w < u(t, m) — m for all w > m. 

Now consider 



(67) 
(68) 



<£(d + u(i, to) — m) — to)) 



u+u(t,m)— m e -z 2 /2 



u{t,m) V27T 
u e -(io+ti(t,m)-m) 2 /2 



2 71 

ro g-«(t,m) 2 /2 
= 



< 



2vr 

» p -(w+t/m-t 2 /m 3 ) 2 jl 



27T 



where the first inequality followed from (|6"5j) . and the second from (f6"7) (assuming m is large enough). 
Continuing from the last upper bound, 

v e -w 2 /2 
dw — e 1 I — ^=-f(w,t)dw, 



2tt 
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where 

t 2 i s t 



/M = exp (_ + _-_). 
Therefore, we have 

$(v + u(t,m)-m)-$(u(t,m)) _ t ( f m ^TSr/K*) dw \ _ f 

*(t,)-*(m) 6 "I r^dw J" ' 1 J 

It is clear that f(w, t) — > 1 as w — > oo. Fixing e, choose mo large enough so that for all w > too, we 
have t) — 1| < e. Then the term multiplying e _t on the right-hand side in (|69[) . for to > too, is 

which shows that the right-hand side in ([69]) is < e • e~* < e, and completes the proof. □ 



A. 13 Statement and proof of Lemma | 
Lemma 12. For any fixed j k , s k , j, s (and fixed A, s a), we have 

g^)( J -, a )-(S Jfc , J /S Jfc , Jt )g^)(j- fc ,^) = ai Au {jkhsAu{jk})(j s) 

1 — ^jk ,j I ^jk - jk 

where Sj fcJ - denotes the covariance between g^ A ' SA ^ (jk, Sk) and g^ A,SA \j, s), 

E XT{I-P A ) Xj <? 
^ [s k sl(X A )+X jk ][s-sl(X A )+X j y 

Proof. Simple manipulations of the left-hand side in (|70[) yield the expression 

Xj{I-P A )y-6 jk , j -Xj h {I-P A )y 



(70) 



8 - s T A {X A )+X J - 9 jkJ ■ [s k ~ s T A {X A )+X jk ] ' 
where 9 jk j = Xj k (I - P A )Xj/(Xj k (I - P A )X jk ). Now it remains to show that (JTTJ) is equal to 

Xj(I-P Au{3k} )y 



(71) 



s — s 



T (X \+ V ' W 

We show individually that the numerators and denominators in (fTTj) and (j72|) are equal. First the 
denominators: starting with (|71j) . notice that 



s - s T A (X A )+X 3 - e jktj [s k - s T A {X A )+X jk \ = s - s T AU{lk} 



By the well-known formula for partial regression coefficients, 



(x A )+( Xj -e^x Jk ) 



(73) 



XT(I~Pa)X 1 r ^ 1 



i.e., 9j k .j is the (jk)th coefficient in the regression of Xj on X AlJ ij k \. Hence to show that ([73| is equal 
to the denominator in (|72|) . we need to show that (X A ) + (Xj —0j k ,jXj k ) gives the coefficients in A in 
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the regression of Xj on X^ij{j k }- This follows by simply noting that the coefficients (X Au{j k }) + X j = 
{0A,j>9jk,j) satisfy the equation 

and so solving for Oaj, 

6 A ,j = {X A ) + {P Av{jk} Xj - 6, ]k ,jX Jk ) = (X A ) (Xj - 6j k ,jX jk ). 
Now for the numerators: again beginning with (|7ip . its numerator is 

y T (I-P A )(Xj-6 jkd X jk ), (74) 
and by essentially the same argument as above, we have 

Pa(Xj - 9j k ,jX jk ) = P A u{j k }Xj, 
therefore (|74"|) matches the numerator in (J72J). □ 
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